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Preface 


If you are a scientist, an analyst, a consultant, or anybody else who has to prepare 
technical documents or reports, one of the most important skills you need to have is 
the ability to make compelling data visualizations, generally in the form of figures. 
Figures will typically carry the weight of your arguments. They need to be clear, 
attractive, and convincing. The difference between good and bad figures can be the 
difference between a highly influential or an obscure paper, a grant or contract won 
or lost, a job interview gone well or poorly. And yet, there are surprisingly few resour¬ 
ces to teach you how to make compelling data visualizations. Few colleges offer cour¬ 
ses on this topic, and there are not that many books on this topic either. (Some exist, 
of course.) Tutorials for plotting software typically focus on how to achieve specific 
visual effects rather than explaining why certain choices are preferred and others not. 
In your day-to-day work, you are simply expected to know how to make good figures, 
and if you’re lucky you have a patient adviser who teaches you a few tricks as you’re 
writing your first scientific papers. 

In the context of writing, experienced editors talk about “ear,” the ability to hear 
(internally, as you read a piece of prose) whether the writing is any good. I think that 
when it comes to figures and other visualizations, we similarly need “eye,” the ability 
to look at a figure and see whether it is balanced, clear, and compelling. And just as is 
the case with writing, the ability to see whether a figure works or not can be learned. 
Having eye means primarily that you are aware of a larger collection of simple rules 
and principles of good visualization, and that you pay attention to little details that 
other people might not. 

In my experience, again just as in writing, you don’t develop eye by reading a book 
over the weekend. It is a lifelong process, and concepts that are too complex or too 
subtle for you today may make much more sense five years from now. I can say for 
myself that I continue to evolve in my understanding of figure preparation. I rou¬ 
tinely try to expose myself to new approaches, and I pay attention to the visual and 
design choices others make in their figures. I’m also open to changing my mind. I 
might today consider a given figure great, but next month I might find a reason to 



criticize it. So with this in mind, please don’t take anything I say as gospel. Think crit¬ 
ically about my reasoning for certain choices and decide whether you want to adopt 
them or not. 

While the materials in this book are presented in a logical progression, most chapters 
can stand on their own, and there is no need to read the book cover to cover. Feel free 
to skip around, to pick out a specific section that you’re interested in at the moment, 
or one that covers a particular design choice you’re pondering. In fact, I think you 
will get the most out of this book if you don’t read it all at once, but rather read it 
piecemeal over longer stretches of time, try to apply just a few concepts from the 
book in your figuremaking, and come back to read about other concepts or reread 
sections on concepts you learned about a while back. You may find that the same 
chapter tells you different things if you reread it after a few months have passed. 

Even though nearly all of the figures in this book were made with R and ggplot2,1 do 
not see this as an R book. I am talking about general principles of figure preparation. 
The software used to make the figures is incidental. You can use any plotting software 
you want to generate the kinds of figures I’m showing here. However, ggplot2 and 
similar packages make many of the techniques I’m using much simpler than other 
plotting libraries. Importantly, because this is not an R book, I do not discuss code or 
programming techniques anywhere in this book. I want you to focus on the concepts 
and the figures, not on the code. If you are curious about how any of the figures were 
made, you can check out the book’s source code at its GitHub repository. 

Thoughts on Graphing Software and Figure-Preparation 
Pipelines 

I have over two decades of experience preparing figures for scientific publications and 
have made thousands of figures. If there has been one constant over these two deca¬ 
des, it’s been the change in figure preparation pipelines. Every few years, a new plot¬ 
ting library is developed or a new paradigm arises, and large groups of scientists 
switch over to the hot new toolkit. I have made figures using gnuplot, Xfig, Mathema- 
tica, Matlab, matplotlib in Python, base R, ggplot2 in R, and possibly others I can’t 
currently remember. My current preferred approach is ggplot2 in R, but I don’t expect 
that I’ll continue using it until I retire. 

This constant change in software platforms is one of the key reasons why this book is 
not a programming book and why I have left out all code examples. I want this book 
to be useful to you regardless of which software you use, and I want it to remain val¬ 
uable even once everybody has moved on from ggplot2 and is using the next new 
thing. I realize that this choice may be frustrating to some ggplot2 users who would 
like to know how I made a given figure. However, anybody who is curious about my 
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coding techniques can read the source code of the book. It is available. Also, in the 
future I may release a supplementary document focused just on the code. 

One thing I have learned over the years is that automation is your friend. I think fig¬ 
ures should be autogenerated as part of the data analysis pipeline (which should also 
be automated), and they should come out of the pipeline ready to be sent to the 
printer, with no manual post-processing needed. I see a lot of trainees autogenerate 
rough drafts of their figures, which they then import into Illustrator for sprucing up. 
There are several reasons why this is a bad idea. First, the moment you manually edit 
a figure, your final figure becomes irreproducible. A third party cannot generate the 
exact same figure you did. While this may not matter much if all you did was change 
the font of the axis labels, the lines are blurry, and it’s easy to cross over into territory 
where things are less clear-cut. As an example, let’s say you want to manually replace 
cryptic labels with more readable ones. A third party may not be able to verify that 
the label replacement was appropriate. Second, if you add a lot of manual post¬ 
processing to your figure-preparation pipeline, then you will be more reluctant to 
make any changes or redo your work. Thus, you may ignore reasonable requests for 
change made by collaborators or colleagues, or you may be tempted to reuse an old 
figure even though you’ve actually regenerated all the data. Third, you may yourself 
forget what exactly you did to prepare a given figure, or you may not be able to gener¬ 
ate a future figure on new data that exactly visually matches your earlier figure. These 
are not made-up examples. I’ve seen all of them play out with real people and real 
publications. 

For all these reasons, interactive plot programs are a bad idea. They inherently force 
you to manually prepare your figures. In fact, it’s probably better to autogenerate a 
figure draft and spruce it up in Illustrator than to make the entire figure by hand in 
some interactive plot program. Please be aware that Excel is an interactive plot pro¬ 
gram as well and is not recommended for figure preparation (or data analysis). 

One critical component in a book on data visualization is the feasibility of the pro¬ 
posed visualizations. It’s nice to invent some elegant new type of visualization, but if 
nobody can easily generate figures using this visualization then there isn’t much use 
to it. For example, when Tufte first proposed sparklines nobody had an easy way of 
making them. While we need visionaries who move the world forward by pushing 
the envelope of what’s possible, I envision this book to be practical and directly appli¬ 
cable to working data scientists preparing figures for their publications. Therefore, 
the visualizations I propose in the subsequent chapters can be generated with a few 
lines of R code via ggplot2 and readily available extension packages. In fact, nearly 
every figure in this book, with the exception of a few figures in Chapters 26, 27, and 
28, was autogenerated exactly as shown. 
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Conventions Used in This Book 

The following typographical conventions are used in this book: 

Italic 

Indicates new terms, URLs, email addresses, filenames, and file extensions. 
Constant width 

Used to refer to program elements such as variable or function names, state¬ 
ments, and keywords. 



This element signifies a tip or suggestion. 



This element signifies a general note. 


This element indicates a warning or caution. 


Using Code Examples 

Supplemental material is available for download at https://github.com/clauswilke/data 
viz. 

This book is here to help you get your job done. In general, if example code is offered 
with this book, you may use it in your programs and documentation. You do not 
need to contact us for permission unless you’re reproducing a significant portion of 
the code. For example, writing a program that uses several chunks of code from this 
book does not require permission. Selling or distributing a CD-ROM of examples 
from O’Reilly books does require permission. Answering a question by citing this 
book and quoting example code does not require permission. Incorporating a signifi¬ 
cant amount of example code from this book into your product’s documentation does 
require permission. 
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We appreciate, but do not require, attribution. An attribution usually includes the 
title, author, publisher, and ISBN. For example: “Fundamentals of Data Visualization 
by Claus O. Wilke (O’Reilly). Copyright 2019 Claus O. Wilke, 978-1-492-03108-6.” 

You may find that additional uses fall within the scope of fair use (for example, reus¬ 
ing a few figures from the book). If you feel your use of code examples or other con¬ 
tent falls outside fair use or the permission given above, feel free to contact us at 
permissions@oreilly.com. 

O'Reilly Online Learning 


O'REILLY’ 


For almost 40 years, O’Reilly Media has provided technology 
and business training, knowledge, and insight to help compa¬ 
nies succeed. 


Our unique network of experts and innovators share their knowledge and expertise 
through books, articles, conferences, and our online learning platform. O’Reillys 
online learning platform gives you on-demand access to live training courses, in- 
depth learning paths, interactive coding environments, and a vast collection of text 
and video from O’Reilly and 200+ other publishers. For more information, please 
visit http://oreilly.com. 

How to Contact Us 

Please address comments and questions concerning this book to the publisher: 

O’Reilly Media, Inc. 

1005 Gravenstein Highway North 
Sebastopol, CA 95472 

800-998-9938 (in the United States or Canada) 

707-829-0515 (international or local) 

707-829-0104 (fax) 

We have a web page for this book, where we list errata, examples, and any additional 
information. You can access this page at http://bit.ly/fundamentals-of-data- 
visualization. 

To comment or ask technical questions about this book, send email to bookques- 
tions@oreilly.com. 

For more information about our books, courses, conferences, and news, see our web¬ 
site at http://www.oreilly.com. 
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CHAPTER 1 


Introduction 


Data visualization is part art and part science. The challenge is to get the art right 
without getting the science wrong, and vice versa. A data visualization first and fore¬ 
most has to accurately convey the data. It must not mislead or distort. If one number 
is twice as large as another, but in the visualization they look to be about the same, 
then the visualization is wrong. At the same time, a data visualization should be aes¬ 
thetically pleasing. Good visual presentations tend to enhance the message of the vis¬ 
ualization. If a figure contains jarring colors, imbalanced visual elements, or other 
features that distract, then the viewer will find it harder to inspect the figure and 
interpret it correctly. 

In my experience, scientists frequently (though not always!) know how to visualize 
data without being grossly misleading. However, they may not have a well-developed 
sense of visual aesthetics, and they may inadvertently make visual choices that detract 
from their desired message. Designers, on the other hand, may prepare visualizations 
that look beautiful but play fast and loose with the data. It is my goal to provide useful 
information to both groups. 

This book attempts to cover the key principles, methods, and concepts required to 
visualize data for publications, reports, or presentations. Because data visualization is 
a vast field, and in its broadest definition could include topics as varied as schematic 
technical drawings, 3D animations, and user interfaces, I necessarily had to limit my 
scope. I am specifically covering the case of static visualizations presented in print, 
online, or as slides. The book does not cover interactive visuals or movies, except in 
one brief section in Chapter 16. Therefore, throughout this book, I will use the words 
“visualization” and “figure” somewhat interchangeably. The book also does not pro¬ 
vide any instruction on how to make figures with existing visualization software or 
programming libraries. The annotated bibliography at the end of the book includes 
pointers to appropriate texts covering these topics. 
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The book is divided into three parts. The first, “From Data to Visualization,” describes 
different types of plots and charts, such as bar graphs, scatterplots, and pie charts. Its 
primary emphasis is on the science of visualization. In this part, rather than attempt¬ 
ing to provide encyclopedic coverage of every conceivable visualization approach, I 
discuss a core set of visuals that you will likely encounter in publications and/or need 
in your own work. In organizing this part, I have attempted to group visualizations by 
the type of message they convey rather than by the type of data being visualized. Stat¬ 
istical texts often describe data analysis and visualization by type of data, organizing 
the material by number and type of variables (one continuous variable, one discrete 
variable, two continuous variables, one continuous and one discrete variable, etc.). I 
believe that only statisticians find this organization helpful. Most other people think 
in terms of a message, such as how large something is, how it is composed of parts, 
how it relates to something else, and so on. 

The second part, “Principles of Figure Design,” discusses various design issues that 
arise when assembling data visualizations. Its primary but not exclusive emphasis is 
on the aesthetic aspect of data visualization. Once we have chosen the appropriate 
type of plot or chart for our dataset, we have to make aesthetic choices about the vis¬ 
ual elements, such as colors, symbols, and font sizes. These choices can affect both 
how clear a visualization is and how elegant it looks. The chapters in this second part 
address the most common issues that I have seen arise repeatedly in practical 
applications. 

The third part, “Miscellaneous Topics,” covers a few remaining issues that didn’t fit 
into the first two parts. It discusses file formats commonly used to store images and 
plots, provides thoughts about the choice of visualization software, and explains how 
to place individual figures into the context of a larger document. 

Ugly, Bad, and Wrong Figures 

Throughout this book, I frequently show different versions of the same figures, some 
as examples of how to make a good visualization and some as examples of how not to. 
To provide a simple visual guideline of which examples should be emulated and 
which should be avoided, I am labeling problematic figures as “ugly,” “bad,” or 
“wrong” (Figure 1-1): 

Ugly 

A figure that has aesthetic problems but otherwise is clear and informative 

Bad 

A figure that has problems related to perception; it may be unclear, confusing, 
overly complicated, or deceiving 

Wrong 

A figure that has problems related to mathematics; it is objectively incorrect 
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Figure 1-1. Examples of ugly, bad, and wrong figures, (a) A bar plot showing three val¬ 
ues (A = 3, B = 5, and C = 4). This is a reasonable visualization with no major flaws, (b) 
An ugly version of part (a). While the plot is technically correct, it is not aesthetically 
pleasing. The colors are too bright and not useful. The background grid is too prominent. 
The text is displayed using three different fonts in three different sizes, (c) A bad version 
of part (a). Each bar is shown with its own y axis scale. Because the scales don’t align, 
this makes the figure misleading. One can easily get the impression that the three values 
are closer together than they actually are. (d) A wrong version of part (a). Without an 
explicit y axis scale, the numbers represented by the bars cannot be ascertained. The bars 
appear to be of lengths 1, 3, and 2, even though the values displayed are meant to be 3, 

5, and 4. 



I am not explicitly labeling good figures. Any figure that isn’t labeled as flawed should 
be assumed to be at least acceptable. It is a figure that is informative, looks appealing, 
and could be printed as is. Note that among the good figures, there will still be differ¬ 
ences in quality, and some good figures will be better than others. 

I generally provide my rationale for specific ratings, but some are a matter of taste. In 
general, the “ugly” rating is more subjective than the “bad” or “wrong” rating. More¬ 
over, the boundary between “ugly” and “bad” is somewhat fluid. Sometimes poor 
design choices can interfere with human perception to the point where a “bad” rating 
is more appropriate than an “ugly” rating. In any case, I encourage you to develop 
your own eye and to critically evaluate my choices. 
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PARTI 


From Data to Visualization 




CHAPTER 2 


Visualizing Data: Mapping Data 

onto Aesthetics 


Whenever we visualize data, we take data values and convert them in a systematic 
and logical way into the visual elements that make up the final graphic. Even though 
there are many different types of data visualizations, and on first glance a scatterplot, 
a pie chart, and a heatmap don’t seem to have much in common, all these visualiza¬ 
tions can be described with a common language that captures how data values are 
turned into blobs of ink on paper or colored pixels on a screen. The key insight is the 
following: all data visualizations map data values into quantifiable features of the 
resulting graphic. We refer to these features as aesthetics. 

Aesthetics and Types of Data 

Aesthetics describe every aspect of a given graphical element. A few examples are 
provided in Figure 2-1. A critical component of every graphical element is of course 
its position, which describes where the element is located. In standard 2D graphics, 
we describe positions by an x and y value, but other coordinate systems and one- or 
three-dimensional visualizations are possible. Next, all graphical elements have a 
shape, a size, and a color. Even if we are preparing a black-and-white drawing, graphi¬ 
cal elements need to have a color to be visible: for example, black if the background is 
white or white if the background is black. Finally, to the extent we are using lines to 
visualize data, these lines may have different widths or dash-dot patterns. Beyond the 
examples shown in Figure 2-1, there are many other aesthetics we may encounter in a 
data visualization. For example, if we want to display text, we may have to specify font 
family, font face, and font size, and if graphical objects overlap, we may have to spec¬ 
ify whether they are partially transparent. 
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Figure 2-1. Commonly used aesthetics in data visualization: position, shape, size, color, 
line width, line type. Some of these aesthetics can represent both continuous and discrete 
data (position, size, line width, color), while others can usually only represent discrete 
data (shape, line type). 

All aesthetics fall into one of two groups: those that can represent continuous data 
and those that cannot. Continuous data values are values for which arbitrarily fine 
intermediates exist. For example, time duration is a continuous value. Between any 
two durations, say 50 seconds and 51 seconds, there are arbitrarily many intermedi¬ 
ates, such as 50.5 seconds, 50.51 seconds, 50.50001 seconds, and so on. By contrast, 
number of persons in a room is a discrete value. A room can hold 5 persons or 6, but 
not 5.5. For the examples in Figure 2-1, position, size, color, and line width can repre¬ 
sent continuous data, but shape and line type can usually only represent discrete data. 

Next we’ll consider the types of data we may want to represent in our visualization. 
You may think of data as numbers, but numerical values are only two out of several 
types of data we may encounter. In addition to continuous and discrete numerical 
values, data can come in the form of discrete categories, in the form of dates or times, 
and as text (Table 2-1). When data is numerical we also call it quantitative and when 
it is categorical we call it qualitative. Variables holding qualitative data are factors, and 
the different categories are called levels. The levels of a factor are most commonly 
without order (as in the example of dog, cat, fish in Table 2-1), but factors can also be 
ordered, when there is an intrinsic order among the levels of the factor (as in the 
example of good, fair, poor in Table 2-1). 
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Table 2-1. Types of variables encountered in typical data visualization scenarios. 


Type of variable 

Examples 

Appropriate 

scale 

Description 

Quantitative/ 

numerical 

continuous 

1.3,57,83,1.5 x 
10‘ 2 

Continuous 

Arbitrary numerical values. These can be integers, rational 
numbers, or real numbers. 

Quantitative/ 
numerical discrete 

1,2,3,4 

Discrete 

Numbers in discrete units. These are most commonly but not 
necessarily integers. For example, the numbers 0.5,1.0,1.5 
could also be treated as discrete if intermediate values cannot 
exist in the given dataset. 

Qualitative/ 

categorical 

unordered 

dog, cat, fish 

Discrete 

Categories without order. These are discrete and unique 
categories that have no inherent order. These variables are 
also called factors. 

Qualitative/ 

categorical 

ordered 

good, fair, poor 

Discrete 

Categories with order. These are discrete and unique 
categories with an order. For example, "fair" always lies 
between "good" and "poor." These variables are also called 
ordered factors. 

Date or time 

Jan. 5 2018,8:03am 

Continuous or 
discrete 

Specific days and/or times. Also generic dates, such as July 4 
or Dec. 25 (without year). 

Text 

The guick brown fox 
jumps over the lazy 
dog. 

None, or 
discrete 

Free-form text. Can be treated as categorical if needed. 


To examine a concrete example of these various types of data, take a look at Table 2-2. 
It shows the first few rows of a dataset providing the daily temperature normals (aver¬ 
age daily temperatures over a 30-year window) for four US locations. This table con¬ 
tains five variables: month, day, location, station ID, and temperature (in degrees 
Fahrenheit). Month is an ordered factor, day is a discrete numerical value, location is 
an unordered factor, station ID is similarly an unordered factor, and temperature is a 
continuous numerical value. 

Table 2-2. First 8 rows of a dataset listing daily temperature normals for four weather 
stations. Data source: National Oceanic and Atmospheric Administration (NOAA). 


1 Month 

Day 

Location 

Station ID 

Temperature (°F) 1 

Jan 

1 

Chicago 

USW00014819 

25.6 

Jan 

1 

San Diego 

USW00093107 

55.2 

Jan 

1 

Houston 

USW00012918 

53.9 

Jan 

1 

Death Valley 

USC00042319 

51.0 

Jan 

2 

Chicago 

USW00014819 

25.5 

Jan 

2 

San Diego 

USW00093107 

55.3 

Jan 

2 

Houston 

USW00012918 

53.8 

Jan 

2 

Death Valley 

USC00042319 

51.2 
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Scales Map Data Values onto Aesthetics 

To map data values onto aesthetics, we need to specify which data values correspond 
to which specific aesthetics values. For example, if our graphic has an x axis, then we 
need to specify which data values fall onto particular positions along this axis. Simi¬ 
larly, we may need to specify which data values are represented by particular shapes 
or colors. This mapping between data values and aesthetics values is created via 
scales. A scale defines a unique mapping between data and aesthetics (Figure 2-2). 
Importantly, a scale must be one-to-one, such that for each specific data value there is 
exactly one aesthetics value and vice versa. If a scale isn’t one-to-one, then the data 
visualization becomes ambiguous. 
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Figure 2-2. Scales link data values to aesthetics. Here, the numbers 1 through 4 have 
been mapped onto a position scale, a shape scale, and a color scale. For each scale, each 
number corresponds to a unique position, shape, or color, and vice versa. 

Let’s put things into practice. We can take the dataset shown in Table 2-2, map tem¬ 
perature onto the y axis, day of the year onto the x axis, and location onto color, and 
visualize these aesthetics with solid lines. The result is a standard line plot showing 
the temperature normals at the four locations as they change during the year 
(Figure 2-3). 

Figure 2-3 is a fairly standard visualization for a temperature curve and likely the vis¬ 
ualization most data scientists would intuitively choose first. However, it is up to us 
which variables we map onto which scales. For example, instead of mapping tempera¬ 
ture onto they axis and location onto color, we can do the opposite. Because now the 
key variable of interest (temperature) is shown as color, we need to show sufficiently 
large colored areas for the colors to convey useful information [Stone, Albers Szafir, 
and Setlur 2014]. Therefore, for this visualization I have chosen squares instead of 
lines, one for each month and location, and I have colored them by the average tem¬ 
perature normal for each month (Figure 2-4). 
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Figure 2-3. Daily temperature normals for four selected locations in the US. Temperature 
is mapped to the y axis, day of the year to the x axis, and location to line color. Data 
source: NOAA. 


Chicago 
San Diego 
Houston 
Death Valley 

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 

month 

Figure 2-4. Monthly normal mean temperatures for four locations in the US. Data 
source: NOAA. 

I would like to emphasize that Figure 2-4 uses two position scales (month along the x 
axis and location along the y axis), but neither is a continuous scale. Month is an 
ordered factor with 12 levels and location is an unordered factor with 4 levels. There¬ 
fore, the two position scales are both discrete. For discrete position scales, we gener¬ 
ally place the different levels of the factor at an equal spacing along the axis. If the 
factor is ordered (as is here the case for month), then the levels need to be placed in 
the appropriate order. If the factor is unordered (as is here the case for location), then 
the order is arbitrary, and we can choose any order we want. I have ordered the loca¬ 
tions from overall coldest (Chicago) to overall hottest (Death Valley) to generate a 


temperature (°F) 
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pleasant staggering of colors. However, I could have chosen any other order and the 
figure would have been equally valid. 

Both Figures 2-3 and 2-4 used three scales in total, two position scales and one color 
scale. This is a typical number of scales for a basic visualization, but we can use more 
than three scales at once. Figure 2-5 uses five scales—two position scales, one color 
scale, one size scale, and one shape scale—and each scale represents a different vari¬ 
able from the dataset. 
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Figure 2-5. Fuel efficiency versus displacement, for 32 cars (1973-74 models). This figure 
uses five separate scales to represent data: (i) the x axis (displacement); (ii) the y axis 
(fuel efficiency); (iii) the color of the data points (power); (iv) the size of the data points 
(weight); and (v) the shape of the data points (number of cylinders). Four of the five 
variables displayed (displacement, fuel efficiency, power, and weight) are numerical con¬ 
tinuous. The remaining one (number of cylinders) can be considered to be either numer¬ 
ical discrete or qualitative ordered. Data source: Motor Trend, 1974. 
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CHAPTER 3 


Coordinate Systems and Axes 


To make any sort of data visualization, we need to define position scales, which deter¬ 
mine where in a graphic different data values are located. We cannot visualize data 
without placing different data points at different locations, even if we just arrange 
them next to each other along a line. For regular 2D visualizations, two numbers are 
required to uniquely specify a point, and therefore we need two position scales. These 
two scales are usually but not necessarily the x and y axes of the plot. We also have to 
specify the relative geometric arrangement of these scales. Conventionally, the x axis 
runs horizontally and the y axis vertically, but we could choose other arrangements. 
For example, we could have the y axis run at an acute angle relative to the x axis, or 
we could have one axis run in a circle and the other run radially. The combination of 
a set of position scales and their relative geometric arrangement is called a coordinate 
system. 

Cartesian Coordinates 

The most widely used coordinate system for data visualization is the 2D Cartesian 
coordinate system, where each location is uniquely specified by an x and a y value. The 
x and y axes run orthogonally to each other, and data values are placed in an even 
spacing along both axes (Figure 3-1). The two axes are continuous position scales, 
and they can represent both positive and negative real numbers. To fully specify the 
coordinate system, we need to specify the range of numbers each axis covers. In 
Figure 3-1, the x axis runs from -2.2 to 3.2 and the y axis runs from -2.2 to 2.2. Any 
data values between these axis limits are placed at the appropriate respective location 
in the plot. Any data values outside the axis limits are discarded. 


13 




2 





X = 

i_ 4 

(2,1) 

CO 

X n 


(0,0) 



y= i 

03 vJ 
>s 

y=-i 





.1 

H.-D* 

' x = -1 





0 1 
x axis 


Figure 3-1. Standard Cartesian coordinate system. The horizontal axis is conventionally 
called x and the vertical axis y. The two axes form a grid with equidistant spacing. Here, 
both the x and the y grid lines are separated by units of one. The point (2, 1) is located 
two x units to the right and one y unit above the origin (0, 0). The point (-1, -1) is loca¬ 
ted one x unit to the left and one y unit below the origin. 


Data values usually aren’t just numbers, however. They come with units. For example, 
if we’re measuring temperature, the values may be measured in degrees Celsius or 
Fahrenheit. Similarly, if we’re measuring distance, the values may be measured in 
kilometers or miles, and if we’re measuring duration, the values may be measured in 
minutes, hours, or days. In a Cartesian coordinate system, the spacing between grid 
lines along an axis corresponds to discrete steps in these data units. In a temperature 
scale, for example, we may have a grid line every 10 degrees Fahrenheit, and in a dis¬ 
tance scale, we may have a grid line every 5 kilometers. 

A Cartesian coordinate system can have two axes representing two different units. 
This situation arises quite commonly whenever we’re mapping two different types of 
variables to x and y. For example, in Figure 2-3, we plotted temperature versus days of 
the year. The y axis of Figure 2-3 is measured in degrees Fahrenheit, with a grid line 
every at 20 degrees, and the x axis is measured in months, with a grid line at the first 
of every third month. Whenever the two axes are measured in different units, we can 
stretch or compress one relative to the other and maintain a valid visualization of the 
data (Figure 3-2). Which version is preferable may depend on the story we want to 


14 | Chapter 3: Coordinate Systems and Axes 









convey. A tall and narrow figure emphasizes change along the y axis and a short and 
wide figure does the opposite. Ideally, we want to choose an aspect ratio that ensures 
that any important differences in position are noticeable. 



Figure 3-2. Daily temperature normals for Houston, TX. Temperature is mapped to the 
y axis and day of the year to the x axis. Parts (a), (b), and (c) show the same figure in 
different aspect ratios. All three parts are valid visualizations of the temperature data. 
Data source: NOAA. 


On the other hand, if the x and y axes are measured in the same units, then the grid 
spacings for the two axes should be equal, such that the same distance along the x or y 
axis corresponds to the same number of data units. As an example, we can plot the 
temperature in Houston, TX, against the temperature in San Diego, CA, for every day 
of the year (Figure 3-3a). Since the same quantity is plotted along both axes, we need 
to make sure that the grid lines form perfect squares, as is the case in Figure 3-3a. 
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Figure 3-3. Daily temperature normals for Houston, TX, plotted versus the respective 
temperature normals of San Diego, CA. The first days of the months January, April, July, 
and October are highlighted to provide a temporal reference, (a) Temperatures are 
shown in degrees Fahrenheit, (b) Temperatures are shown in degrees Celsius. Data 
source: NOAA. 


You may wonder what happens if you change the units of your data. After all, units 
are arbitrary, and your preferences might be different from somebody else’s. A change 
in units is a linear transformation, where we add or subtract a number to or from all 
data values and/or multiply all data values with another number. Fortunately, Carte¬ 
sian coordinate systems are invariant under such linear transformations. Therefore, 
you can change the units of your data and the resulting figure will not change as long 
as you change the axes accordingly. As an example, compare Figures 3-3a and 3-3b. 
Both show the same data, but in part (a) the temperature units are degrees Fahrenheit 
and in part (b) they are degrees Celsius. Even though the grid lines are in different 
locations and the numbers along the axes are different, the two data visualizations 
look exactly the same. 

Nonlinear Axes 

In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both 
in data units and in the resulting visualization. We refer to the position scales in these 
coordinate systems as linear. While linear scales generally provide an accurate repre¬ 
sentation of the data, there are scenarios where nonlinear scales are preferred. In a 
nonlinear scale, even spacing in data units corresponds to uneven spacing in the 
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visualization, or conversely even spacing in the visualization corresponds to uneven 
spacing in data units. 

The most commonly used nonlinear scale is the logarithmic scale, or log scale for 
short. Log scales are linear in multiplication, such that a unit step on the scale corre¬ 
sponds to multiplication with a fixed value. To create a log scale, we need to log- 
transform the data values while exponentiating the numbers that are shown along the 
axis grid lines. This process is demonstrated in Figure 3-4, which shows the numbers 
1, 3.16, 10, 31.6, and 100 placed on linear and log scales. The numbers 3.16 and 31.6 
may seem like strange choices, but they were selected because they are exactly half¬ 
way between 1 and 10 and between 10 and 100 on a log scale. We can see this by 
observing that 10° 5 = i/T0 = 3.16, and equivalently 3.16 x 3.16 = 10. Similarly, 10 L5 = 
10 x 10 0 5 = 31.6. 
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Figure 3-4. Relationship between linear and logarithmic scales. The dots correspond to 
the data values 1, 3.16, 10, 31.6, and 100, which are evenly spaced numbers on a loga¬ 
rithmic scale. We can display these data points on a linear scale, we can log-transform 
them and then show them on a linear scale, or we can show them on a logarithmic scale. 
Importantly, the correct axis title for a logarithmic scale is the name of the variable 
shown, not the logarithm of that variable. 

Mathematically, there is no difference between plotting the log-transformed data on a 
linear scale or plotting the original data on a logarithmic scale (Figure 3-4). The only 
difference lies in the labeling for the individual axis ticks and for the axis as a whole. 
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In most cases, the labeling for a logarithmic scale is preferable, because it places less 
mental burden on the reader to interpret the numbers shown as the axis tick labels. 
There is also less of a risk of confusion about the base of the logarithm. When work¬ 
ing with log-transformed data, we can get confused about whether the data was trans¬ 
formed using the natural logarithm or the logarithm to base 10. And it’s not 
uncommon for labeling to be ambiguous—e.g., log(x), which doesn’t specify a base at 
all. I recommend that you always verify the base when working with log-transformed 
data. When plotting log-transformed data, always specify the base in the labeling of 
the axis. 

Because multiplication on a log scale looks like addition on a linear scale, log scales 
are the natural choice for any data that has been obtained by multiplication or divi¬ 
sion. In particular, ratios should generally be shown on a log scale. As an example, I 
have taken the number of inhabitants in each county in Texas and divided it by the 
median number of inhabitants across all Texas counties. The resulting ratio is a num¬ 
ber that can be larger or smaller than 1. A ratio of exactly 1 implies that the corre¬ 
sponding county has the median number of inhabitants. When visualizing these 
ratios on a log scale, we can see that the population numbers in Texas counties are 
symmetrically distributed around the median, and that the most populous counties 
have over 100 times more inhabitants than the median while the least populous coun¬ 
ties have over 100 times fewer inhabitants (Figure 3-5). 
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Figure 3-5. Population numbers of Texas counties relative to their median value. Select 
counties are highlighted by name. The dashed line indicates a ratio of 1, corresponding to 
a county with median population number. The most populous counties have approxi¬ 
mately 100 times more inhabitants than the median county, and the least populous 
counties have approximately 100 times fewer inhabitants than the median county. Data 
source: 2010 US Decennial Census. 

By contrast, for the same data, a linear scale obscures the differences between a 
county with median population number and a county with a much smaller popula¬ 
tion number than median (Figure 3-6). 
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Figure 3-6. Population sizes of Texas counties relative to their median value. By display¬ 
ing a ratio on a linear scale, we have overemphasized ratios > 1 and have obscured 
ratios < 1. As a general rule, ratios should not be displayed on a linear scale. Data 
source: 2010 US Decennial Census. 


On a log scale, the value 1 is the natural midpoint, similar to the value 0 on a linear 
scale. We can think of values greater than 1 as representing multiplications and values 
less than 1 divisions. For example, we can write 10 = 1 x 10 and 0.1 = 1/10. The value 
0, on the other hand, can never appear on a log scale. It lies infinitely far from 1. One 
way to see this is to consider that log(0) = -°o. Or, alternatively, consider that to go 
from 1 to 0, it takes either an infinite number of divisions by a finite value (e.g., 
1/10/10/10/10/10/10-•• = 0) or one division by infinity (i.e., l/°° = 0). 

Log scales are frequently used when the dataset contains numbers of very different 
magnitudes. For the Texas counties shown in Figures 3-5 and 3-6, the most populous 
one (Harris) had 4,092,459 inhabitants in the 2010 US Census while the least popu¬ 
lous one (Loving) had 82. So, a log scale would be appropriate even if we hadn’t divi¬ 
ded the population numbers by their median to turn them into ratios. But what 
would we do if there was a county with 0 inhabitants? This county could not be 
shown on the logarithmic scale, because it would lie at minus infinity. In this situa¬ 
tion, the recommendation is sometimes to use a square-root scale, which uses a 
square-root transformation instead of a log transformation (Figure 3-7). Just like a 
log scale, a square-root scale compresses larger numbers into a smaller range, but 
unlike a log scale, it allows for the presence of 0. 
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Figure 3-7. Relationship between linear and square-root scales. The dots correspond to 
the data values 0, 1, 4, 9, 16, 25, 36, and 49, which are evenly spaced numbers on a 
square-root scale, since they are the squares of the integers from 0 to 7. We can display 
these data points on a linear scale, we can square-root-transform them and then show 
them on a linear scale, or we can show them on a square-root scale. 

I see two problems with square-root scales. First, while on a linear scale one unit step 
corresponds to addition or subtraction of a constant value, and on a log scale it corre¬ 
sponds to multiplication with or division by a constant value, no such rule exists for a 
square-root scale. The meaning of a unit step on a square-root scale depends on the 
scale value at which we’re starting. Second, it is unclear how to best place axis ticks on 
a square-root scale. To obtain evenly spaced ticks, we would have to place them at 
squares, but axis ticks at, for example, positions 0, 4, 25, 49, and 81 (every second 
square) would be unintuitive. Alternatively, we could place them at linear intervals 
(10, 20, 30, etc.), but this would result in either too few axis ticks near the low end of 
the scale or too many near the high end. In Figure 3-7, 1 have placed the axis ticks at 
positions 0, 1, 5, 10, 20, 30, 40, and 50 on the square-root scale. These values are arbi¬ 
trary but provide a reasonable covering of the data range. 

Despite these problems with square-root scales, they are valid position scales and I do 
not discount the possibility that they have appropriate applications. For example, just 
like a log scale is the natural scale for ratios, one could argue that the square-root 
scale is the natural scale for data that comes in squares. One scenario in which data is 
naturally squares is in the context of geographic regions. If we show the areas of geo¬ 
graphic regions on a square-root scale, we are highlighting the regions’ linear extent 
from east to west or north to south. These extents could be relevant, for example, if 
we were wondering how long it might take to drive across a region. Figure 3-8 shows 
the areas of states in the US Northeast on both a linear and a square-root scale. Even 
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though the areas of these states are quite different (Figure 3-8a), the relative time it 
will take to drive across each state is more accurately represented by the figure on the 
square-root scale (Figure 3-8b) than the figure on the linear scale (Figure 3-8a). 
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Figure 3-8. Areas of northeastern US states, (a) Areas shown on a linear scale, (b) Areas 
shown on a square-root scale. Data source: Google. 

Coordinate Systems with Curved Axes 

All the coordinate systems we have encountered so far have used two straight axes 
positioned at a right angle to each other, even if the axes themselves established a 
nonlinear mapping from data values to positions. There are other coordinate systems, 
however, where the axes themselves are curved. In particular, in the polar coordinate 
system, we specify positions via an angle and a radial distance from the origin, and 
therefore the angle axis is circular (Figure 3-9). 

Polar coordinates can be useful for data of a periodic nature, such that data values at 
one end of the scale can be logically joined to data values at the other end. For exam¬ 
ple, consider the days in a year. December 31st is the last day of the year, but it is also 
one day before the first day of the year. If we want to show how some quantity varies 
over the year, it can be appropriate to use polar coordinates with the angle coordinate 
specifying each day. Let’s apply this concept to the temperature normals of Figure 2-3. 
Because temperature normals are average temperatures that are not tied to any spe¬ 
cific year, Dec. 31st can be thought of as 366 days later than Jan. 1st (temperature nor¬ 
mals include Feb. 29th) and also 1 day earlier. 
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By plotting the temperature normals in a polar coordinate system, we emphasize this 
cyclical property they have (Figure 3-10). In comparison to Figure 2-3, the polar ver¬ 
sion highlights how similar the temperatures are in Death Valley, Houston, and San 
Diego from late fall to early spring. In the Cartesian coordinate system, this fact is 
obscured because the temperature values in late December and in early January are 
shown in opposite parts of the figure and therefore don’t form a single visual unit. 
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Figure 3-9. Relationship between Cartesian and polar coordinates, (a) Three data points 
shown in a Cartesian coordinate system, (b) The same three data points shown in a 
polar coordinate system. We have taken the x coordinates from part (a) and used them 
as angular coordinates and the y coordinates from part (a) and used them as radial 
coordinates. The circular axis runs from 0 to 4 in this example, and therefore x = 0 and 
x = 4 are the same locations in this coordinate system. 
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date 

Figure 3-10. Daily temperature normals for four selected locations in the US, shown in 
polar coordinates. The radial distance from the center point indicates the daily tempera¬ 
ture in Fahrenheit, and the days of the year are arranged counterclockwise starting with 
Jan. 1st at the 6:00position. Data source: NOAA. 


A second setting in which we encounter curved axes is in the context of geospatial 
data, i.e., maps. Locations on the globe are specified by their longitude and latitude. 
But because the earth is a sphere, drawing latitude and longitude as Cartesian axes is 
misleading and not recommended (Figure 3-11). Instead, we use various types of 
nonlinear projections that attempt to minimize artifacts and that strike different bal¬ 
ances between conserving areas or angles relative to the true shape lines on the globe 
(Figure 3-11). 
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Figure 3-11. Map of the world, shown in four different projections. The Cartesian longi¬ 
tude and latitude system maps the longitude and latitude of each location onto a regular 
Cartesian coordinate system. This mapping causes substantial distortions in both areas 
and angles relative to their true values on the 3D globe. The interrupted Goode homolo¬ 
sine projection perfectly represents true surface areas, at the cost of dividing some land 
masses into separate pieces, most notably Greenland and Antarctica. The Robinson pro¬ 
jection and the Winkel tripel projection both strike a balance between angular and area 
distortions, and they are commonly used for maps of the entire globe. 
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CHAPTER 4 


Color Scales 


There are three fundamental use cases for color in data visualizations: we can use 
color to distinguish groups of data from each other, to represent data values, and to 
highlight. The types of colors we use and the way in which we use them are quite dif¬ 
ferent for these three cases. 

Color as a Tool to Distinguish 

We frequently use color as a means to distinguish discrete items or groups that do not 
have an intrinsic order, such as different countries on a map or different manufactur¬ 
ers of a certain product. In this case, we use a qualitative color scale. Such a scale con¬ 
tains a finite set of specific colors that are chosen to look clearly distinct from each 
other while also being equivalent to each other. The second condition requires that 
no one color should stand out relative to the others. Also, the colors should not create 
the impression of an order, as would be the case with a sequence of colors that get 
successively lighter. Such colors would create an apparent order among the items 
being colored, which by definition have no order. 

Many appropriate qualitative color scales are readily available. Figure 4-1 shows three 
representative examples. In particular, the ColorBrewer project provides a nice selec¬ 
tion of qualitative color scales, including both fairly light and fairly dark colors 
[Brewer 2017]. 
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Okabe Ito 



ColorBrewer Dark2 



Figure 4-1. Example qualitative color scales. The Okabe Ito scale is the default scale used 
throughout this book [Okabe and Ito 2008], The ColorBrewer Dark2 scale is provided by 
the ColorBrewer project [Brewer 2017], The ggplot2 hue scale is the default qualitative 
scale in the widely used plotting software ggplot2. 

As an example of how we use qualitative color scales, consider Figure 4-2. It shows 
the percent population growth from 2000 to 2010 in US states. I have arranged the 
states in order of their population growth, and I have colored them by geographic 
region. This coloring highlights that states in the same regions have experienced simi¬ 
lar population growth. In particular, states in the West and the South have seen the 
largest population increases, whereas states in the Midwest and the Northeast have 
grown much less. 
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Figure 4-2. Population growth in the US from 2000 to 2010. States in the West and 
South have seen the largest increases, whereas states in the Midwest and Northeast have 
seen much smaller increases (or even, in the case of Michigan, a decrease). Data source: 
US Census Bureau. 


Color to Represent Data Values 

Color can also be used to represent quantitative data values, such as income, tempera¬ 
ture, or speed. In this case, we use a sequential color scale. Such a scale contains a 
sequence of colors that clearly indicate which values are larger or smaller than which 
other ones, and how distant two specific values are from each other. The second point 
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implies that the color scale needs to be perceived to vary uniformly across its entire 
range. 

Sequential scales can be based on a single hue (e.g., from dark blue to light blue) or 
on multiple hues (e.g., from dark red to light yellow) (Figure 4-3). Multihue scales 
tend to follow color gradients that can be seen in the natural world, such as dark red, 
green, or blue to light yellow, or dark purple to light green. The reverse (e.g., dark 
yellow to light blue) looks unnatural and doesn’t make a useful sequential scale. 


ColorBrewer Blues 



Figure 4-3. Example sequential color scales. The ColorBrewer Blues scale is a monochro¬ 
matic scale that varies from dark to light blue. The Heat and Viridis scales are multihue 
scales that vary from dark red to light yellow and from dark blue via green to light yel¬ 
low, respectively. 

Representing data values as colors is particularly useful when we want to show how 
the data values vary across geographic regions. In this case, we can draw a map of the 
geographic regions and color them by the data values. Such maps are called choro- 
pleths. Figure 4-4 shows an example where I have mapped annual median income 
within each county in Texas onto a map of those counties. 

In some cases, we need to visualize the deviation of data values in one of two direc¬ 
tions relative to a neutral midpoint. One straightforward example is a dataset con¬ 
taining both positive and negative numbers. We may want to show those with 
different colors, so that it is immediately obvious whether a value is positive or nega¬ 
tive as well as how far in either direction it deviates from zero. The appropriate color 
scale in this situation is a diverging color scale. We can think of a diverging scale as 
two sequential scales stitched together at a common midpoint, which usually is repre¬ 
sented by a light color (Figure 4-5). Diverging scales need to be balanced, so that the 
progression from light colors in the center to dark colors on the outside is approxi¬ 
mately the same in either direction. Otherwise, the perceived magnitude of a data 
value would depend on whether it fell above or below the midpoint value. 
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Figure 4-4. Median annual income in Texas counties. The highest median incomes are 
seen in major Texas metropolitan areas, in particular near Houston and Dallas. No 
median income estimate is available for Loving County in West Texas, and therefore that 
county is shown in gray. Data source: 2015 Five-Year American Community Survey. 


CARTO Earth 



ColorBrewer PiYG 



Figure 4-5. Example diverging color scales. Diverging scales can be thought of as two 
sequential scales stitched together at a common midpoint color. Common color choices 
for diverging scales include brown to greenish blue, pink to yellow-green, and blue to red. 
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As an example application of a diverging color scale, consider Figure 4-6, which 
shows the percentage of people identifying as white in Texas counties. Even though 
percentage is always a positive number, a diverging scale is justified here, because 
50% is a meaningful midpoint value. Numbers above 50% indicate that whites are in 
the majority and numbers below 50% indicate the opposite. The visualization clearly 
shows in which counties whites are in the majority, in which they are in the minority, 
and in which whites and nonwhites occur in approximately equal proportions. 
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Figure 4-6. Percentage of people identifying as white in Texas counties. Whites are in the 
majority in North and East Texas but not in South or West Texas. Data source: 2010 US 
Decennial Census. 
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Color as a Tool to Highlight 

Color can also be an effective tool to highlight specific elements in the data. There 
may be specific categories or values in the dataset that carry key information about 
the story we want to tell, and we can strengthen the story by emphasizing the relevant 
figure elements to the reader. An easy way to achieve this emphasis is to color these 
figure elements in a color or set of colors that vividly stand out against the rest of the 
figure. This effect can be achieved with accent color scales, which are color scales that 
contain both a set of subdued colors and a matching set of stronger, darker, and/or 
more saturated colors (Figure 4-7). 


Okabe Ito Accent 



Grays with accents 



ColorBrewer Accent 



Figure 4-7. Example accent color scales, each with four base colors and three accent col¬ 
ors. Accent color scales can be derived in several different ways: (top) we can take an 
existing color scale (e.g., the Okabe Ito scale, Figure 4-1 ) and lighten and/or partially 
desaturate some colors while darkening others; (middle) we can take gray values and 
pair them with colors; (bottom) we can use an existing accent color scale (e.g., the one 
from the ColorBrewer project). 

As an example of how the same data can support differing stories with different col¬ 
oring approaches, I have created a variant of Figure 4-2 where now I highlight two 
specific states, Texas and Louisiana (Figure 4-8). Both states are in the South, they are 
immediate neighbors, and yet one state (Texas) was the fifth fastest growing state 
within the US from 2000 to 2010 whereas the other was the third slowest growing. 
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Figure 4-8. From 2000 to 2010, the two neighboring southern states, Texas and Louisi¬ 
ana, experienced among the highest and lowest population growth across the US. Data 
source: US Census Bureau. 


When working with accent colors, it is critical that the baseline colors do not compete 
for attention. Notice how drab the baseline colors are in Figure 4-8, yet they work 
well to support the accent color. It is easy to make the mistake of using baseline colors 
that are too colorful, so that they end up competing for the readers attention against 
the accent colors. There is an easy remedy, however: just remove all color from all 
elements in the figure except the highlighted data categories or points. An example of 
this strategy is provided in Figure 4-9. 
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Figure 4-9. Track athletes are among the shortest and leanest of male professional ath¬ 
letes participating in popular sports. Data source: [Telford and Cunningham 1991 ]. 
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CHAPTER 5 


Directory of Visualizations 


This chapter provides a quick visual overview of the various plots and charts that are 
commonly used to visualize different types of data. It is meant both to serve as a table 
of contents, in case you are looking for a particular visualization whose name you 
may not know, and as a source of inspiration, if you need to find alternatives to the 
figures you routinely make. 

Amounts 


Bars Bars Dots 



The most common approach to visualizing amounts (i.e., numerical values shown for 
some set of categories) is using bars, either vertically or horizontally arranged (Chap¬ 
ter 6). However, instead of using bars, we can also place dots at the location where the 
corresponding bar would end (Chapter 6). 
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Grouped Bars 



Grouped Bars Stacked Bars Stacked Bars 



Heatmap 



If there are two or more sets of categories for which we want to show amounts, we 
can group or stack the bars (Chapter 6). We can also map the categories onto the x 
and y axes and show amounts by color, via a heatmap (Chapter 6). 

Distributions 


Histogram Density Plot Cumulative Density Quantile-Quantile Plot 



Histograms and density plots (Chapter 7) provide the most intuitive visualizations of 
a distribution, but both require arbitrary parameter choices and can be misleading. 
Cumulative densities and quantile-quantile (q-q) plots (Chapter 8) always represent 
the data faithfully but can be more difficult to interpret. 
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Boxplots 



Violins Strip Charts 



Sina Plots 
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Stacked Histograms Overlapping Densities 
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Ridgeline Plot 



Boxplots, violin plots, strip charts, and sina plots are useful when we want to visualize 
many distributions at once and/or if we are primarily interested in overall shifts 
among the distributions (see “Visualizing Distributions Along the Vertical Axis” on 
page 81). Stacked histograms and overlapping densities allow a more in-depth com¬ 
parison of a smaller number of distributions, though stacked histograms can be diffi¬ 
cult to interpret and are best avoided (see “Visualizing Multiple Distributions at the 
Same Time” on page 64). Ridgeline plots can be a useful alternative to violin plots and 
are often useful when visualizing very large numbers of distributions or changes in 
distributions over time (see “Visualizing Distributions Along the Horizontal Axis” on 
page 88). 

Proportions 




Proportions can be visualized as pie charts, side-by-side bars, or stacked bars (Chap¬ 
ter 10). As for amounts, when we visualize proportions with bars, the bars can be 
arranged either vertically or horizontally. Pie charts emphasize that the individual 
parts add up to a whole and highlight simple fractions. However, the individual pieces 
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are more easily compared in side-by-side bars. Stacked bars look awkward for a single 
set of proportions, but can be useful when comparing multiple sets of proportions. 


Multiple Pie Charts Grouped Bars 

••• | I 


Stacked Bars Stacked Densities 



When visualizing multiple sets of proportions or changes in proportions across con¬ 
ditions, pie charts tend to be space-inefficient and often obscure relationships. Grou¬ 
ped bars work well as long as the number of conditions compared is moderate, and 
stacked bars can work for large numbers of conditions. Stacked densities (Chap¬ 
ter 10) are appropriate when the proportions change along a continuous variable. 


Mosaic Plot 




Parallel Sets 



When proportions are specified according to multiple grouping variables, mosaic 
plots, treemaps, or parallel sets are useful visualization approaches (Chapter 11). 
Mosaic plots assume that every level of one grouping variable can be combined with 
every level of another grouping variable, whereas treemaps do not make such an 
assumption. Treemaps work well even if the subdivisions of one group are entirely 
distinct from the subdivisions of another. Parallel sets work better than either mosaic 
plots or treemaps when there are more than two grouping variables. 
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x-y relationships 
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Scatterplots (Chapter 12) represent the archetypical visualization when we want to 
show one quantitative variable relative to another. If we have three quantitative vari¬ 
ables, we can map one onto the dot size, creating a variant of the scatterplot called a 
bubble chart. For paired data, where the variables along the x and y axes are meas¬ 
ured in the same units, it is generally helpful to add a line indicating x-y (see 
“Paired Data” on page 127). Paired data can also be shown as a slopegraph of paired 
points connected by straight lines. 


Density Contours 



2D Bins 
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For large numbers of points, regular scatterplots can become uninformative due to 
overplotting. In this case, contour lines, 2D bins, or hex bins may provide an alterna¬ 
tive (Chapter 18). When we want to visualize more than two quantities, on the other 
hand, we may choose to plot correlation coefficients in the form of a correlogram 
instead of the underlying raw data (see “Correlograms” on page 121). 
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Line Graph 


Connected Scatterplot 


Smooth Line Graph 



When the x axis represents time or a strictly increasing quantity such as a treatment 
dose, we commonly draw line graphs (Chapter 13). If we have a temporal sequence of 
two response variables we can draw a connected scatterplot, where we first plot the 
two response variables in a scatterplot and then connect dots corresponding to adja¬ 
cent time points (see “Time Series of Two or More Response Variables” on page 138). 
We can use smooth lines to represent trends in a larger dataset (Chapter 14). 

Geospatial Data 



Choropleth 



Cartogram 


Cartogram Heatmap 
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The primary mode of showing geospatial data is in the form of a map (Chapter 15). A 
map takes coordinates on the globe and projects them onto a flat surface, such that 
shapes and distances on the globe are approximately represented by shapes and dis¬ 
tances in the 2D representation. In addition, we can show data values in different 
regions by coloring those regions in the map according to the data. Such a map is 
called a choropleth (see “Choropleth Mapping” on page 172). In some cases, it may be 
helpful to distort the different regions according to some other quantity (e.g., popula¬ 
tion number) or simplify each region into a square. Such visualizations are called car- 
tograms (see “Cartograms” on page 176). 


42 | Chapter 5: Directory of Visualizations 













Uncertainty 


Error Bars Error Bars 2D Error Bars Graded Error Bars 



Error bars are meant to indicate the range of likely values for some estimate or meas¬ 
urement. They extend horizontally and/or vertically from some reference point rep¬ 
resenting the estimate or measurement (Chapter 16). Reference points can be shown 
in various ways, such as by dots or by bars. Graded error bars show multiple ranges at 
the same time, where each range corresponds to a different degree of confidence. 
They are in effect multiple error bars with different line thicknesses plotted on top of 
each other. 

Confidence Strips Eyes Half-Eyes Quantile Dot Plot 



To achieve a more detailed visualization than is possible with error bars or graded 
error bars, we can visualize the actual confidence or posterior distributions (Chap¬ 
ter 16). Confidence strips provide a visual sense of uncertainty but are difficult to 
read accurately. Eyes and half-eyes combine error bars with approaches to visualize 
distributions (violins and ridgelines, respectively), and thus show both precise ranges 
for some confidence levels and the overall uncertainty distribution. A quantile dot 
plot can serve as an alternative visualization of an uncertainty distribution (see 
“Framing Probabilities as Frequencies” on page 181). Because it shows the distribu¬ 
tion in discrete units, the quantile dot plot is not as precise but can be easier to read 
than the continuous distribution shown by a violin or ridgeline plot. 
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For smooth line graphs, the equivalent of an error bar is a confidence band (see “Vis¬ 
ualizing the Uncertainty of Curve Fits” on page 197). It shows a range of values the 
line might pass through at a given confidence level. Like with error bars, we can draw 
graded confidence bands that show multiple confidence levels at once. We can also 
show individual fitted draws in lieu of or in addition to the confidence bands. 
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CHAPTER 6 


Visualizing Amounts 


In many scenarios, we are interested in the magnitude of some set of numbers. For 
example, we might want to visualize the total sales volume of different brands of cars, 
or the total number of people living in different cities, or the age of Olympians per¬ 
forming different sports. In all these cases, we have a set of categories (e.g., brands of 
cars, cities, or sports) and a quantitative value for each category. I refer to these cases 
as visualizing amounts, because the main emphasis in these visualizations will be on 
the magnitude of the quantitative values. The standard visualization in this scenario 
is the bar plot, which has several variations, including simple bars as well as grouped 
and stacked bars. Alternatives to the bar plot are the dot plot and the heatmap. 

Bar Plots 

To motivate the concept of a bar plot, consider the total ticket sales for the most pop¬ 
ular movies on a given weekend. Table 6-1 shows the top five highest-grossing films 
for the weekend before Christmas in 2017. Star Wars: The Last Jedi was by far the 
most popular movie on that weekend, outselling the fourth- and fifth-ranked movies, 
The Greatest Showman and Ferdinand, by almost a factor of 10. 

Table 6-1. Highest-grossing movies for the weekend of December 22-24, 2017. Data source: 
Box Office Mojo. Used with permission. 


1 Rank 

Title 

Weekend gross 1 

i 

Star Wars: The last Jedi 

$71,565,498 

2 

Jumanji: Welcome to the Jungle 

$36,169,328 

3 

Pitch Perfect 3 

$19,928,525 

4 

The Greatest Showman 

$8,805,843 

5 

Ferdinand 

$7,316,746 
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This kind of data is commonly visualized with vertical bars. For each movie, we draw 
a bar that starts at zero and extends all the way to the dollar value for that movies 
weekend gross (Figure 6-1). This visualization is called a bar plot or bar chart. 
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Star Wars Jumanji Pitch Perfect 3 Greatest Showman Ferdinand 

Figure 6-1. Highest-grossing movies for the weekend of December 22-24, 2017, displayed 
as a bar plot. Data source: Box Office Mojo. Used with permission. 

One problem we commonly encounter with vertical bars is that the labels identifying 
each bar take up a lot of horizontal space. In fact, I had to make Figure 6-1 fairly wide 
and space out the bars so that I could place the movie titles underneath. To save hori¬ 
zontal space, we could place the bars closer together and rotate the labels 
(Figure 6-2). However, I am not a big proponent of rotated labels. I find the resulting 
plots awkward and difficult to read. And, in my experience, whenever the labels are 
too long to place horizontally, they also don’t look good rotated. 
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Figure 6-2. Highest-grossing movies for the weekend of December 22-24, 2017, displayed 
as a bar plot with rotated axis tick labels. Rotated axis tick labels tend to be difficult to 
read and require awkward space use underneath the plot. For these reasons, I generally 
consider plots with rotated tick labels to be ugly. Data source: Box Office Mojo. Used 
with permission. 


The better solution for long labels is usually to swap the x and y axes, so that the bars 
run horizontally (Figure 6-3). After swapping the axes, we obtain a compact figure in 
which all visual elements, including all text, are horizontally oriented. As a result, the 
figure is much easier to read than Figure 6-2 or even Figure 6-1. 
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Figure 6-3. Highest-grossing movies for the weekend of December 22-24, 2017, displayed 
as a horizontal bar plot. Data source: Box Office Mojo. Used with permission. 

Regardless of whether we place bars vertically or horizontally, we need to pay atten¬ 
tion to the order in which the bars are arranged. I often see bar plots where the bars 
are arranged arbitrarily or by some criterion that is not meaningful in the context of 
the figure. Some plotting programs arrange bars by default in alphabetical order of 
the labels, and other similarly arbitrary arrangements are possible (Figure 6-4). In 
general, the resulting figures are more confusing and less intuitive than figures where 
bars are arranged in order of their size. 

We should only rearrange bars, however, when there is no natural ordering to the cat¬ 
egories the bars represent. Whenever there is a natural ordering (i.e., when our cate¬ 
gorical variable is an ordered factor), we should retain that ordering in the 
visualization. For example, Figure 6-5 shows the median annual income in the US by 
age groups. In this case, the bars should be arranged in order of increasing age. Sort¬ 
ing by bar height while shuffling the age groups makes no sense (Figure 6-6). 
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Figure 6-4. Highest-grossing movies for the weekend of December 22-24, 2017, displayed 
as a horizontal bar plot. Here, the bars have been placed in descending order of the 
lengths of the movie titles. This arrangement of bars is arbitrary, doesn’t serve a mean¬ 
ingful purpose, and makes the resulting figure much less intuitive than Figure 6-3. Data 
source: Box Office Mojo. Used with permission. 
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Figure 6-5. 2016 median US annual household income versus age group. The 45-to-54- 
year age group has the highest median income. Data source: US Census Bureau. 
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Figure 6-6. 2016 median US annual household income versus age group, sorted by 
income. While this order of bars looks visually appealing, the order of the age groups is 
now confusing. Data source: US Census Bureau. 



Pay attention to the bar order. If the bars represent unordered cate¬ 
gories, order them by ascending or descending data values. 


Grouped and Stacked Bars 

All the examples from the previous section showed how a quantitative amount varied 
with respect to one categorical variable. Frequently, however, we are interested in two 
categorical variables at the same time. For example, the US Census Bureau provides 
median income levels broken down by both age and race. We can visualize this data¬ 
set with a grouped bar plot (Figure 6-7). In a grouped bar plot, we draw a group of 
bars at each position along the x axis, determined by one categorical variable, and 
then we draw bars within each group according to the other categorical variable. 

Grouped bar plots show a lot of information at once, and they can be confusing. In 
fact, even though I have not labeled Figure 6-7 as bad or ugly I find it difficult to 
read. In particular, it is difficult to compare median incomes across age groups for a 
given racial group. So, this figure is only appropriate if we are primarily interested in 
the differences in income levels among racial groups, separately for specific age 
groups. If we care more about the overall pattern of income levels among racial 
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groups, it may be preferable to show race along the x axis and show ages as distinct 
bars within each racial group (Figure 6-8). 
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Figure 6-7. 2016 median US annual household income versus age group and race. Age 
groups are shown along the x axis, and for each age group there are four bars, corre¬ 
sponding to the median income of Asian, white, Hispanic, and black people, respectively. 
Data source: US Census Bureau. 
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Figure 6-8. 2016 median US annual household income versus age group and race. In 
contrast to Figure 6-7, now race is shown along the x axis, and for each race we show 
seven bars according to the seven age groups. Data source: US Census Bureau. 
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Both Figures 6-7 and 6-8 encode one categorical variable by position along the * axis 
and the other by bar color. And in both cases, the encoding by position is easy to read 
while the encoding by bar color requires more mental effort, as we have to mentally 
match the colors of the bars against the colors in the legend. We can avoid this added 
mental effort by showing four separate regular bar plots rather than one grouped bar 
plot (Figure 6-9). Which of these various options we choose is ultimately a matter of 
taste. I would likely choose Figure 6-9, because it circumvents the need for different 
bar colors. 
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Figure 6-9. 2016 median US annual household income versus age group and race. 

Instead of displaying this data as a grouped bar plot, as in Figures 6-7 and 6-8, we now 
show the data as four separate regular bar plots. This choice has the advantage that we 
don’t need to encode either categorical variable by bar color. Data source: US Census 
Bureau. 

Instead of drawing groups of bars side-by-side, it is sometimes preferable to stack 
bars on top of each other. Stacking is useful when the sum of the amounts repre¬ 
sented by the individual stacked bars is in itself a meaningful amount. So, while it 
would not make sense to stack the median income values of Figure 6-7 (the sum of 
two median income values is not a meaningful value), it might make sense to stack 
the weekend gross values of Figure 6-1 (the sum of the weekend gross values of two 
movies is the total gross for the two movies combined). Stacking is also appropriate 
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when the individual bars represent counts. For example, in a dataset of people, we can 
either count men and women separately or we can count them together. If we stack a 
bar representing a count of women on top of a bar representing a count of men, then 
the combined bar height represents the total count of people regardless of gender. 

I will demonstrate this principle using a dataset about the passengers of the transat¬ 
lantic ocean liner Titanic, which sank on April 15, 1912. On board were approxi¬ 
mately 1,300 passengers, not counting crew. The passengers were traveling in one of 
three classes (first, second, or third), and there were almost twice as many male as 
female passengers on the ship. To visualize the breakdown of passengers by class and 
gender, we can draw separate bars for each class and gender and stack the bars repre¬ 
senting women on top of the bars representing men, separately for each class 
(Figure 6-10). The combined bars represent the total number of passengers in each 
class. 
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Figure 6-10. Numbers of female and male passengers on the Titanic traveling in 1st, 2nd, 
and 3rd class. Data source: Encyclopedia Titanica. 

Figure 6-10 differs from the previous bar plots I have shown in that there is no 
explicit y axis. I have instead shown the actual numerical values that each bar repre¬ 
sents. Whenever a plot is meant to display only a small number of different values, it 
makes sense to add the actual numbers to the plot. This substantially increases the 
amount of information conveyed by the plot without adding much visual noise, and 
it removes the need for an explicit y axis. 

Dot Plots and Heatmaps 

Bars are not the only option for visualizing amounts. One important limitation of 
bars is that they need to start at zero, so that the bar length is proportional to the 
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amount shown. For some datasets, this can be impractical or may obscure key fea¬ 
tures. In this case, we can indicate amounts by placing dots at the appropriate loca¬ 
tions along the x or y axis. 

Figure 6-11 demonstrates this visualization approach for a dataset of life expectancies 
in 25 countries in the Americas. The citizens of these countries have life expectancies 
between 60 and 81 years, and each individual life expectancy value is shown with a 
blue dot at the appropriate location along the x axis. By limiting the axis range to the 
interval from 60 to 81 years, the figure highlights the key features of this dataset: Can¬ 
ada has the highest life expectancy among all listed countries, and Bolivia and Haiti 
have much lower life expectancies than all other countries. If we had used bars 
instead of dots (Figure 6-12), wed have made a much less compelling figure. Because 
the bars are so long in this figure, and they all have nearly the same length, the eye is 
drawn to the middle of the bars rather than to their endpoints, and the figure fails to 
convey its message. 
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Figure 6-11. Life expectancies of countries in the Americas, for the year 2007. Data 
source: Gapminder. 
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Figure 6-12. Life expectancies of countries in the Americas, for the year 2007, shown as 
bars. This dataset is not suitable for being visualized with bars. The bars are too long 
and they draw attention away from the key feature of the data, the differences in life 
expectancy among the different countries. Data source: Gapminder. 


Regardless of whether we use bars or dots, however, we need to pay attention to the 
ordering of the data values. In Figures 6-11 and 6-12, the countries are ordered in 
descending order of life expectancy. If we instead ordered them alphabetically, wed 
end up with a disordered cloud of points that is confusing and fails to convey a clear 
message (Figure 6-13). 
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Figure 6-13. Life expectancies of countries in the Americas, for the year 2007. Here, the 
countries are ordered alphabetically, which causes the dots to form a disordered cloud of 
points. This makes the figure difficult to read, and therefore it deserves to be labeled as 
“bad." Data source: Gapminder. 

All the examples so far have represented amounts by location along a position scale, 
either through the endpoint of a bar or the placement of a dot. For very large datasets, 
neither of these options may be appropriate, because the resulting figure would 
become too busy. We already saw in Figure 6-7 that just seven groups of four data 
values can result in a figure that is complex and not that easy to read. If we had 20 
groups of 20 data values, a similar figure would likely be quite confusing. 

As an alternative to mapping data values onto positions via bars or dots, we can map 
data values onto colors. Such a figure is called a heatmap. Figure 6-14 uses this 
approach to show the percentage of internet users over time in 20 countries and for 
23 years, from 1994 to 2016. While this visualization makes it harder to determine the 
exact data values shown (e.g., what’s the exact percentage of internet users in the Uni¬ 
ted States in 2015?), it does an excellent job of highlighting broader trends. We can 
see in which countries internet use began early and in which it did not, and we can 
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also see which countries have high internet penetration in the final year covered by 
the dataset (2016). 
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Figure 6-14. Internet adoption over time, for select countries. Color represents the per¬ 
cent of internet users for the respective country and year. Countries were ordered by per¬ 
cent internet users in 2016. Data source: World Bank. 

As is the case with all other visualization approaches discussed in this chapter, we 
need to pay attention to the ordering of the categorical data values when making 
heatmaps. In Figure 6-14, countries are ordered by the percentage of internet users in 
2016. This ordering places the United Kingdom, Japan, Canada, and Germany above 
the United States, because all these countries had higher internet penetration in 2016 
than the United States, even though the United States saw significant internet use at 
an earlier time. Alternatively, we could order countries by how early they started to 
see significant internet usage. In Figure 6-15, countries are ordered by the year in 
which internet usage first rose to above 20%. In this figure, the United States falls into 
the third position from the top, and it stands out for having relatively low internet 
usage in 2016 compared to how early internet usage started there. A similar pattern 
can be seen for Italy. Israel and France, by contrast, started relatively late but gained 
ground rapidly. 
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Figure 6-15. Internet adoption over time, for select countries. Countries were ordered by 
the year in which their internet usage first exceeded 20%. Data source: World Bank. 

Both Figures 6-14 and 6-15 are valid representations of the data. Which one is prefer¬ 
red depends on the story we want to convey. If our story is about internet usage in 
2016, then Figure 6-14 is probably the better choice. If, however, our story is about 
how early or late adoption of the internet relates to current-day usage, then 
Figure 6-15 is preferable. 
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CHAPTER 7 


Visualizing Distributions: Histograms and 

Density Plots 


We frequently encounter the situation where we would like to understand how a par¬ 
ticular variable is distributed in a dataset. To give a concrete example, we will con¬ 
sider the passengers of the Titanic, a dataset we encountered in Chapter 6. There were 
approximately 1,300 passengers on the Titanic (not counting crew), and we have 
reported ages for 756 of them. We might want to know how many passengers of what 
ages there were on the Titanic, i.e., how many children, young adults, middle-aged 
people, seniors, and so on. We call the relative proportions of different ages among 
the passengers the age distribution of the passengers. 

Visualizing a Single Distribution 

We can obtain a sense of the age distribution among the passengers by grouping all 
passengers into bins with comparable ages and then counting the number of passen¬ 
gers in each bin. This procedure results in a table such as Table 7-1. 

Table 7-1. Numbers of passengers with known age on the Titanic. 


Age range 

Count I 

1 Age range 

Count I 

1 Age range 

Count I 

0-5 

36 

31-35 

76 

61-65 

16 

6-10 

19 

36-40 

74 

66-70 

3 

11-15 

18 

41-45 

54 

71-75 

3 

16-20 

99 

46-50 

50 



21-25 

139 

51-55 

26 



26-30 

121 

56-60 

22 
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We can visualize this table by drawing filled rectangles whose heights correspond to 
the counts and whose widths correspond to the width of the age bins (Figure 7-1). 
Such a visualization is called a histogram. (Note that all bins must have the same 
width for the visualization to be a valid histogram.) 
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Figure 7-1. Histogram of the ages of Titanic passengers. Data source: Encyclopedia 
Titanica. 

Because histograms are generated by binning the data, their exact visual appearance 
depends on the choice of the bin width. Most visualization programs that generate 
histograms will choose a bin width by default, but chances are that bin width is not 
the most appropriate one for any histogram you may want to make. It is therefore 
critical to always try different bin widths to verify that the resulting histogram reflects 
the underlying data accurately. In general, if the bin width is too small, then the histo¬ 
gram becomes overly peaky and visually busy and the main trends in the data may be 
obscured. On the other hand, if the bin width is too large, then smaller features in the 
distribution of the data, such as the dip around age 10 in this example, may disappear. 

For the age distribution of Titanic passengers, we can see that a bin width of 1 year is 
too small and a bin width of 15 years is too large, whereas bin widths of between 3 to 
5 years work fine (Figure 7-2). 
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Figure 7-2. Histograms depend on the chosen bin width. Here, the same age distribution 
of Titanic passengers is shown with four different bin widths: (a) 1 year; (b) 3 years; (c) 
5 years; (d) 15 years. Data source: Encyclopedia Titanica. 



When making a histogram, always explore multiple bin widths. 


Histograms have been a popular visualization option since at least the 18th century, 
in part because they are easily generated by hand. More recently as extensive com¬ 
puting power has become available in everyday devices such as laptops and cell 
phones, we see them increasingly being replaced by density plots. In a density plot, we 
attempt to visualize the underlying probability distribution of the data by drawing an 
appropriate continuous curve (Figure 7-3). This curve needs to be estimated from the 
data, and the most commonly used method for this estimation procedure is called 
kernel density estimation. In kernel density estimation, we draw a continuous curve 
(the kernel) with a small width (controlled by a parameter called bandwidth) at the 
location of each data point, and then we add up all these curves to obtain the final 
density estimate. The most widely used kernel is a Gaussian kernel (i.e., a Gaussian 
bell curve), but there are many other choices. 
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Figure 7-3. Kernel density estimate of the age distribution of passengers on the Titanic. 
The height of the curve is scaled such that the area under the curve equals 1. The density 
estimate was performed with a Gaussian kernel and a bandwidth of 2. Data source: 
Encyclopedia Titanica. 

Just as is the case with histograms, the exact visual appearance of a density plot 
depends on the kernel and bandwidth choices (Figure 7-4). The bandwidth parame¬ 
ter behaves similarly to the bin width in histograms. If the bandwidth is too small, 
then the density estimate can become overly peaky and visually busy and the main 
trends in the data may be obscured. On the other hand, if the bandwidth is too large, 
then smaller features in the distribution of the data may disappear. In addition, the 
choice of the kernel affects the shape of the density curve. For example, a Gaussian 
kernel will have a tendency to produce density estimates that look Gaussian-like, with 
smooth features and tails. By contrast, a rectangular kernel can generate the appear¬ 
ance of steps in the density curve (Figure 7-4d). In general, the more data points 
there are in the dataset, the less the choice of the kernel matters. Therefore, density 
plots tend to be quite reliable and informative for large datasets but can be misleading 
for datasets of only a few points. 
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Figure 7-4. Kernel density estimates depend on the chosen kernel and bandwidth. Here, 
the same age distribution of Titanic passengers is shown for four different combinations 
of these parameters: (a) Gaussian kernel, bandwidth = 0.5; (b) Gaussian kernel, band¬ 
width = 2; (c) Gaussian kernel, bandwidth = 5; (d) rectangular kernel, bandwidth = 2. 
Data source: Encyclopedia Titanica. 


Density curves are usually scaled such that the area under the curve equals 1. This 
convention can make the y axis scale confusing, because it depends on the units of the 
x axis. For example, in the case of the age distribution, the data range on the x axis 
goes from 0 to approximately 75. Therefore, we expect the mean height of the density 
curve to be 1/75 = 0.013. Indeed, when looking at the age density curves (e.g., 
Figure 7-4), we see that the y values range from 0 to approximately 0.04, with an aver¬ 
age of somewhere close to 0.01. 

Kernel density estimates have one pitfall that we need to be aware of: they have a ten¬ 
dency to produce the appearance of data where none exists, in particular in the tails. 
As a consequence, careless use of density estimates can easily lead to figures that 
make nonsensical statements. For example, if we don’t pay attention, we might gener¬ 
ate a visualization of an age distribution that includes negative ages (Figure 7-5). 
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Figure 7-5. Kernel density estimates can extend the tails of the distribution into areas 
where no data exists and no data is even possible. Here, the density estimate for ages of 
Titani c passengers has been allowed to extend into the negative age range. This is non¬ 
sensical and should be avoided. Data source: Encyclopedia Titanica. 



Always verify that your density estimate does not predict the exis¬ 
tence of nonsensical data values. 


So should you choose a histogram or a density plot to visualize a distribution? Heated 
discussions can be had on this topic. Some people are vehemently against density 
plots and believe that they are arbitrary and misleading. Others realize that histo¬ 
grams can be just as arbitrary and misleading. I think the choice is largely a matter of 
taste, but sometimes one or the other option may more accurately reflect the specific 
features of interest in the data at hand. There is also the possibility of using neither 
and instead choosing empirical cumulative density functions or q-q plots (Chapter 8). 
However, I believe that density estimates have an inherent advantage over histograms 
as soon as we want to visualize more than one distribution at a time. 

Visualizing Multiple Distributions at the Same Time 

In many scenarios we have multiple distributions we would like to visualize simulta¬ 
neously. For example, let’s say we’d like to see how the ages of Titanic passengers are 
distributed between men and women. Were male and female passengers generally of 
the same age, or was there an age difference between the genders? One commonly 
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employed visualization strategy in this case is a stacked histogram, where we draw the 
histogram bars for women on top of the bars for men, in a different color 
(Figure 7-6). 



Figure 7-6. Histogram of the ages of Titanic passengers stratified by gender. This figure 
has been labeled as “bad” because stacked histograms are easily confused with overlap¬ 
ping histograms (see Figure 7-7). In addition, the heights of the bars representing female 
passengers cannot easily be compared to each other. Data source: Encyclopedia Titanica. 


In my opinion, this type of visualization should be avoided. There are two key prob¬ 
lems here. First, from just looking at the figure, it is never entirely clear where exactly 
the bars begin. Do they start where the color changes or are they meant to start at 
zero? In other words, are there about 25 females of age 18-20, or are there almost 80? 
(The former is the case.) Second, the bar heights for the female counts cannot be 
directly compared to each other, because the bars all start at a different height. For 
example, the men were on average older than the women, and this fact is not at all 
visible in Figure 7-6. 

We could try to address these problems by having all bars start at zero and making 
the bars partially transparent (Figure 7-7). 
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Figure 7-7. Age distributions of male and female Titani c passengers, shown as two over¬ 
lapping histograms. This figure has been labeled as “bad” because there is no clear visual 
indication that all blue bars start at a count ofO. Data source: Encyclopedia Titanica. 


However, this approach generates new problems. Now it appears that there are 
actually three different groups, not just two, and we’re still not entirely sure where 
each bar starts and ends. Overlapping histograms don’t work well because a semi¬ 
transparent bar drawn on top of another tends to not look like a semitransparent bar 
but instead like a bar drawn in a different color. 

Overlapping density plots don’t typically have the problem that overlapping histo¬ 
grams have, because the continuous density lines help the eye keep the distributions 
separate. However, for this particular dataset, the age distributions for male and 
female passengers are nearly identical up to around age 17 and then diverge, so that 
the resulting visualization is still not ideal (Figure 7-8). 

A solution that works well for this dataset is to show the age distributions of male and 
female passengers separately, each as a proportion of the overall age distribution 
(Figure 7-9). This visualization shows intuitively and clearly that there were many 
fewer women than men in the 20-to-50-year age range on the Titanic. 
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Figure 7-8. Density estimates of the ages of male and female Titani c passengers. To high¬ 
light that there were more male than female passengers, the density curves were scaled 
such that the area under each curve corresponds to the total number of male and female 
passengers with known age (468 and 288, respectively). Data source: Encyclopedia 
Titanica. 
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Figure 7-9. Age distributions of male and female Titani c passengers, shown as propor¬ 
tions of the total number of passengers. The colored areas show the density estimates of 
the ages of male and female passengers, respectively, and the gray areas show the overall 
passenger age distribution. Data source: Encyclopedia Titanica. 


Visualizing Multiple Distributions at the Same Time | 67 









Finally, when we want to visualize exactly two distributions, we can also make two 
separate histograms, rotate them by 90 degrees, and have the bars in one histogram 
point in the opposite direction of the other. This trick is commonly employed when 
visualizing age distributions, and the resulting plot is usually called an age pyramid 
(Figure 7-10). 
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Figure 7-10. The age distributions of male and female Titanic passengers visualized as 
an age pyramid. Data source: Encyclopedia Titanica. 


Importantly, this trick does not work when there are more than two distributions we 
want to visualize at the same time. For multiple distributions, histograms tend to 
become confusing, whereas density plots work well as long as the distributions are 
somewhat distinct and contiguous. For example, to visualize the distribution of but- 
terfat percentage in the milk of cows from four different cattle breeds, density plots 
are fine (Figure 7-11). 



To visualize several distributions at once, kernel density plots will 
generally work better than histograms. 
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Figure 7-11. Density estimates of the butterfat percentage in the milk of four cattle 
breeds. Data source: Canadian Record of Performance for Purebred Dairy Cattle. 
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CHAPTER 8 


Visualizing Distributions: 
Empirical Cumulative Distribution 
Functions and Q-Q Plots 


In Chapter 7, 1 described how we can visualize distributions with histograms or den¬ 
sity plots. Both of these approaches are intuitive and visually appealing. However, as 
discussed in that chapter, they both share the limitation that the resulting figure 
depends to a substantial degree on parameters the user has to choose, such as the bin 
width for histograms and the bandwidth for density plots. As a result, both have to be 
considered as an interpretation of the data rather than a direct visualization of the 
data itself. 

As an alternative to using histograms or density plots, we could simply show all the 
data points individually, as a point cloud. However, this approach becomes unwieldy 
for very large datasets, and in any case there is value in aggregate methods that high¬ 
light properties of the distribution rather than the individual data points. To solve this 
problem, statisticians have invented empirical cumulative distribution functions 
(ECDFs) and quantile-quantile (q-q) plots. These types of visualizations require no 
arbitrary parameter choices, and they show all of the data at once. Unfortunately, they 
are a little less intuitive than a histogram or a density plot is, and I don’t see them 
used frequently outside of highly technical publications. They are quite popular 
among statisticians, though, and I think anybody interested in data visualization 
should be familiar with these techniques. 

Empirical Cumulative Distribution Functions 

To illustrate ECDFs, I will begin with a hypothetical example that is closely modeled 
after something I deal with a lot as a professor in the classroom: a dataset of student 
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grades. Assume our hypothetical class has 50 students, and the students just comple¬ 
ted an exam on which they could score between 0 and 100 points. How can we best 
visualize the class’s performance, for example to determine appropriate grade bound¬ 
aries? 

We can plot the total number of students that have received at most a certain number 
of points versus all possible point scores. This plot will be an ascending function, 
starting at 0 for 0 points and ending at 50 for 100 points. A different way of thinking 
about this visualization is the following: we can rank all students by the number of 
points they obtained, in ascending order (so the student with the fewest points 
receives the lowest rank and the student with the most points the highest), and then 
plot the rank versus the actual points obtained. The result is an empirical cumulative 
distribution function, or simply cumulative distribution. Each dot represents one stu¬ 
dent, and the lines visualize the highest student rank observed for any possible point 
value (Figure 8-1). 



Figure 8-1. Empirical cumulative distribution function of student grades for a hypotheti¬ 
cal class of 50 students. 

You may wonder what happens if we rank the students the other way round, in 
descending order. This ranking simply flips the function on its head. The result is still 
an empirical cumulative distribution function, but the lines now represent the lowest 
student rank observed for any possible point value (Figure 8-2). 

Ascending cumulative distribution functions are more widely known and more com¬ 
monly used than descending ones, but both have important applications. Descending 
cumulative distribution functions are critical when we want to visualize highly 
skewed distributions, as discussed in the next section. 
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Figure 8-2. Distribution of student grades plotted as a descending ECDF. 

In practical applications, it is quite common to draw the ECDF without highlighting 
the individual points and to normalize the ranks by the maximum rank, so that the y 
axis represents the cumulative frequency (Figure 8-3). 



points 


Figure 8-3. ECDF of student grades. The student ranks have been normalized to the total 
number of students, such that the y values plotted correspond to the fraction of students 
in the class with at most that many points. 
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We can directly read off key properties of the student grade distribution from this 
plot. For example, approximately a quarter of the students (25%) received less than 75 
points. The median point value (corresponding to a cumulative frequency of 0.5) is 
81. Approximately 20% of the students received 90 points or more. 

I find ECDFs handy for assigning grade boundaries because they help me locate the 
exact cutoffs that minimize student unhappiness. For example, in this example, there’s 
a fairly long horizontal line right below 80 points, followed by a steep rise right at 80. 
This feature is caused by three students receiving 80 points on their exam while the 
student with the next highest grade received only 76 points. In this scenario, I might 
decide that everybody with a point score of 80 or more receives a B and everybody 
with 79 or less receives a C. The three students with 80 points are happy that they just 
made a B, and the student with 76 realizes that they would have had to perform much 
better to not receive a C. If I were to set the cutoff at 77, the distribution of letter 
grades would be exactly the same, but I might find the student with 76 points visiting 
my office hoping to negotiate their grade. Likewise, if I had set the cutoff at 81, I 
would likely have had three students in my office trying to negotiate their grade. 

Highly Skewed Distributions 

Many empirical datasets display highly skewed distributions, in particular with heavy 
tails to the right, and these distributions can be challenging to visualize. Examples of 
such distributions include the number of people living in different cities or counties, 
the number of contacts in a social network, the frequency with which individual 
words appear in a book, the number of academic papers written by different authors, 
the net worth of individuals, and the number of interaction partners of individual 
proteins in protein-protein interaction networks [Clauset, Shalizi, and Newman 
2009]. All these distributions have in common that their right tail decays slower than 
an exponential function. In practice, this means that very large values are not that 
rare, even if the mean of the distribution is small. An important class of such distri¬ 
butions are power-law distributions, where the likelihood of observing a value that is 
x times larger than some reference point declines as a power of x. To give a concrete 
example, consider net worth in the US, which is distributed according to a power law 
with exponent 2. At any given level of net worth (say, $1 million), people with half 
that net worth are four times as frequent, and people with twice that net worth are 
one-fourth as frequent. Importantly, the same relationship holds if we use $10,000 as 
reference point or if we use $100 million. For this reason, power-law distributions are 
also called scale-free distributions. 

Here, I will visualize the number of people living in different US counties according 
to the 2010 US Census. This distribution has a very long tail to the right. Even though 
most counties have relatively small numbers of inhabitants (the median is 25,857), a 
few counties have extremely large numbers of inhabitants (e.g., Los Angeles County, 
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with 9,818,605 inhabitants). If we try to visualize the distribution of population 
counts as either a density plot or an ECDF, we obtain figures that are essentially use¬ 
less (Figure 8-4). 
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Figure 8-4. Distribution of the number of inhabitants in US counties, (a) Density plot. 

(b) Empirical cumulative distribution function. Data source: 2010 US Decennial Cen¬ 
sus. 

The density plot (Figure 8-4a) shows a sharp peak right at 0, and virtually no details 
of the distribution are visible. Similarly, the ECDF (Figure 8-4b) shows a rapid rise 
near 0, and again no details of the distribution are visible. For this particular dataset, 
we can log-transform the data and visualize the distribution of the log-transformed 
values. This transformation works here because the distribution of population num¬ 
bers in counties is not actually a power law, but instead is a nearly perfect log-normal 
distribution (see “Quantile-Quantile Plots” on page 78). Indeed, the density plot of 
the log-transformed values shows a nice bell curve and the corresponding ECDF 
shows a nice sigmoidal shape (Figure 8-5). 


Highly Skewed Distributions | 75 





ft ft. 





U.O 

fh A 





ft 0 





u .z 

0.0 





2 3 4 5 6 7 

log, 0 (number of inhabitants) 



log 10 (number of inhabitants) 

Figure 8-5. Distribution of the logarithm of the number of inhabitants in US counties. 

(a) Density plot, (b) Empirical cumulative distribution function. Data source: 2010 US 
Decennial Census. 

To see that this distribution is not a power law, we plot it as a descending ECDF with 
logarithmic x and y axes. In this visualization, a power law appears as a perfect 
straight line. For the population counts in counties, the right tail forms almost but not 
quite a straight line on the descending log-log ECDF plot (Figure 8-6). 
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Figure 8-6. Relative frequency of counties with at least that many inhabitants versus the 
number of county inhabitants. Data source: 2010 US Decennial Census. 

As a second example, I will use the distribution of word frequencies for all words that 
appear in the novel Moby Dick. This distribution follows a perfect power law. When 
plotted as a descending ECDF with logarithmic axes, we see a nearly perfect straight 
line (Figure 8-7). 
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Figure 8-7. Distribution of word counts in the novel Moby Dick. Shown is the relative 
frequency of words that occur at least that many times in the novel versus the number of 
times words are used. Data source: [Clauset, Shalizi, and Newman 2009]. 
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Quantile-Quantile Plots 

Quantile-quantile (q-q) plots are useful visualizations when we want to determine to 
what extent the observed data points do or do not follow a given distribution. Just like 
ECDFs, q-q plots are also based on ranking the data and visualizing the relationship 
between ranks and actual values. However, in q-q plots we don’t plot the ranks 
directly; rather, we use them to predict where a given data point would fall if the data 
were distributed according to a specified reference distribution. Most commonly, q-q 
plots are constructed using a normal distribution as the reference. To give a concrete 
example, assume the actual data values have a mean of 10 and a standard deviation of 
3. Then, assuming a normal distribution, we would expect a data point ranked at the 
50th percentile to lie at position 10 (the mean), a data point at the 84th percentile to 
lie at position 13 (one standard deviation above the mean), and a data point at the 
2.3rd percentile to lie at position 4 (two standard deviations below the mean). We can 
carry out this calculation for all points in the dataset and then plot the observed val¬ 
ues (i.e., values in the dataset) against the theoretical values (i.e., values expected 
given each data point’s rank and the assumed reference distribution). 

When we perform this procedure for the student grades distribution from the begin¬ 
ning of this chapter, we obtain Figure 8-8. 



Figure 8-8. q-q plot of hypothetical student grades. 
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The solid line here is not a regression line but indicates the points where x equals y, 
i.e., where the observed values equal the theoretical ones. To the extent that points fall 
onto that line, the data follows the assumed distribution (here, normal). We see that 
the student grades follow mostly a normal distribution, with a few deviations at the 
bottom and at the top (a few students performed worse than expected on either end). 
The deviations from the distribution at the top end are caused by the maximum point 
value of 100 in the hypothetical exam; regardless of how good the best student is, they 
can at most obtain 100 points. 

We can also use a q-q plot to test my assertion from earlier in this chapter that the 
population counts in US counties follow a log-normal distribution. If these counts are 
log-normally distributed, then their log-transformed values are normally distributed 
and hence should fall right onto the x = y line. When making this plot, we see that the 
agreement between the observed and the theoretical values is exceptional 
(Figure 8-9). This demonstrates that the distribution of population counts among 
counties is indeed log-normal. 



Figure 8-9. q-q plot of the logarithm of the number of inhabitants in US counties. Data 
source: 2010 US Decennial Census. 
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CHAPTER 9 


Visualizing Many Distributions at Once 


There are many scenarios in which we want to visualize multiple distributions at the 
same time. For example, consider weather data. We may want to visualize how tem¬ 
perature varies across different months while also showing the distribution of 
observed temperatures within each month. This scenario requires showing a dozen 
temperature distributions at once, one for each month. None of the visualizations dis¬ 
cussed in Chapters 7 or 8 work well in this case. Instead, viable approaches include 
boxplots, violin plots, and ridgeline plots. 

Whenever we are dealing with many distributions, it is helpful to think in terms of 
the response variable and one or more grouping variables. The response variable is the 
variable whose distributions we want to show. The grouping variables define subsets 
of the data with distinct distributions of the response variable. For example, for tem¬ 
perature distributions across months, the response variable is the temperature and 
the grouping variable is the month. All techniques discussed in this chapter draw the 
response variable along one axis and the grouping variable(s) along the other. In the 
following sections, I will first describe approaches that show the response variable 
along the vertical axis, and then I will describe approaches that show the response 
variable along the horizontal axis. In all cases discussed, we could flip the axes and 
arrive at an alternative and viable visualization. I am showing here the canonical 
forms of the various visualizations. 

Visualizing Distributions Along the Vertical Axis 

The simplest approach to showing many distributions at once is to show their mean 
or median as points, with some indication of the variation around the mean or 
median shown by error bars. Figure 9-1 demonstrates this approach for the distribu¬ 
tions of monthly temperatures in Lincoln, Nebraska, in 2016. I have labeled this fig¬ 
ure as “bad” because there are multiple problems with this approach. First, by 
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representing each distribution by only one point and two error bars, we are losing a 
lot of information about the data. Second, it is not immediately obvious what the 
points represent, even though most readers would likely guess that they represent 
either the mean or the median. Third, it is definitely not obvious what the error bars 
represent. Do they represent the standard deviation of the data, the standard error of 
the mean, a 95% confidence interval, or something else altogether? There is no com¬ 
monly accepted standard. By reading the figure caption of Figure 9-1, we can see that 
they represent here twice the standard deviation of the daily mean temperatures, 
meant to indicate the range that contains approximately 95% of the data. However, 
error bars are more commonly employed to visualize the standard error (or twice the 
standard error for a 95% confidence interval), and it is easy for readers to confuse the 
standard error with the standard deviation. The standard error quantifies how accu¬ 
rate our estimate of the mean is, whereas the standard deviation estimates how much 
spread there is in the data around the mean. It is possible for a dataset to have both a 
very small standard error of the mean and a very large standard deviation. Fourth, 
symmetric error bars are misleading if there is any skew in the data, which is the case 
here and is almost always for real-world datasets. 
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Figure 9-1. Mean daily temperatures in Lincoln, NE, in 2016. Points represent the aver¬ 
age daily mean temperatures for each month, averaged over all days of the month, and 
error bars represent twice the standard deviation of the daily mean temperatures within 
each month. This figure has been labeled as “bad” because error bars are conventionally 
used to visualize the uncertainty of an estimate, not the variability in a population. Data 
source: Weather Underground. 
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We can address all four shortcomings of Figure 9-1 by using a traditional and com¬ 
monly used method for visualizing distributions, the boxplot. A boxplot divides the 
data into quartiles and visualizes them in a standardized manner (Figure 9-2). 

• ‘outlier 

• maximum within upper fence 



third quartile 

median 
first quartile 


• minimum 

Figure 9-2. Anatomy of a boxplot. Shown are a cloud of points (left) and the correspond¬ 
ing boxplot (right). 


Only the y values of the points are visualized in the boxplot in Figure 9-2. The line in 
the middle of the boxplot represents the median, and the box encloses the middle 
50% of the data. The vertical lines extending upwards and downwards from the box 
are called whiskers. The top and bottom whiskers extend either to the maximum and 
minimum values of the data or to the maximum or minimum values that fall within 
1.5 times the height of the box, whichever yields the shorter whisker. The distances of 
1.5 times the height of the box in either direction are called the upper and lower fen¬ 
ces. Individual data points that fall beyond the fences are referred to as outliers and 
are usually shown as individual dots. 

Boxplots are simple yet informative, and they work well when plotted next to each 
other to visualize many distributions at once. For the Lincoln temperature data, using 
boxplots leads to Figure 9-3. In that figure, we can now see that temperature is highly 
skewed in December (most days are moderately cold and a few are extremely cold) 
and not very skewed at all in some other months, such as in July. 
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Figure 9-3. Mean daily temperatures in Lincoln, NE, visualized as boxplots. Data source: 
Weather Underground. 


Boxplots were invented by the statistician John Tukey in the early 1970s, and they 
quickly gained popularity because they were highly informative while being easy to 
draw by hand, which is how most data visualizations were drawn at that time. How¬ 
ever, with modern computing and visualization capabilities, we are not limited to 
what is easily drawn by hand. Therefore, more recently we see boxplots being 
replaced by violin plots (Figure 9-4). Violins can be used whenever one would other¬ 
wise use a boxplot, and they provide a much more nuanced picture of the data. In 
particular, violin plots will accurately represent bimodal data whereas a boxplot will 
not. 

Only the y values of the points are visualized in the violin plot. The width of the vio¬ 
lin at a given y value represents the point density at that y value. Technically, a violin 
plot is a density estimate rotated by 90 degrees and then mirrored (Chapter 7). Vio¬ 
lins are therefore symmetric. Violins begin and end at the minimum and maximum 
data values, respectively. The thickest part of the violin corresponds to the highest 
point density in the dataset. 
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Figure 9-4. Anatomy of a violin plot. Shown are a cloud of points (left) and the corre¬ 
sponding violin plot (right). 



Before using violins to visualize distributions, verify that you have 
sufficiently many data points in each group to justify showing the 
point densities as smooth lines. 


When we visualize the Lincoln temperature data with violins, we obtain Figure 9-5. 
We can now see that some months do have moderately bimodal data. For example, 
the month of November seems to have had two temperature clusters, one around 50 
degrees and one around 35 degrees Fahrenheit. 

Because violin plots are derived from density estimates, they have similar shortcom¬ 
ings. In particular, they can generate the appearance that there is data where none 
exists, or that the dataset is very dense when actually it is quite sparse. We can try to 
circumvent these issues by simply plotting all the individual data points directly, as 
dots (Figure 9-6). Such a figure is called a strip chart. Strip charts are fine in principle, 
as long as we make sure that we don’t plot too many points on top of each other. A 
simple solution to overplotting is to spread out the points somewhat along the x axis, 
by adding some random noise in the x dimension (Figure 9-7). This technique is 
called jittering. 
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Figure 9-5. Mean daily temperatures in Lincoln, NE, visualized as violin plots. Data 
source: Weather Underground. 
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Figure 9-6. Mean daily temperatures in Lincoln, NE, visualized as strip charts. Each 
point represents the mean temperature for one day. This figure is labeled as “bad” 
because so many points are plotted on top of each other that it is not possible to ascertain 
which temperatures were the most common in each month. Data source: Weather 
Underground. 
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Figure 9-7. Mean daily temperatures in Lincoln, NE, visualized as strip charts. The 
points have been jittered along the x axis to better show the density of points at each 
temperature value. Data source: Weather Underground. 



Whenever the dataset is too sparse to justify the violin visualiza¬ 
tion, plotting the raw data as individual points will be possible. 


Finally, we can combine the best of both worlds by spreading out the dots in propor¬ 
tion to the point density at a given y coordinate. This method, called a sina plot [Sidir- 
opoulos et al. 2018], 1 can be thought of as a hybrid between a violin plot and jittered 
points, and it shows each individual point while also visualizing the distributions. In 
Figure 9-8, 1 have drawn the sina plots on top of the violins to highlight the relation¬ 
ship between these two approaches. 


1 The name sina plot is meant to honor Sina Hadi Sohi, a student at the University of Copenhagen, Denmark, 
who wrote the first version of the code that researchers at the university used to make such plots (Frederik O. 
Bagger, personal communication). 
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Figure 9-8. Mean daily temperatures in Lincoln, NE, visualized as sina plots (a combi¬ 
nation of individual points and violins). The points have been jittered along the x axis in 
proportion to the point density at the respective temperature. Here, the sina plots are 
shown superimposed on violin plots. Data source: Weather Underground. 


Visualizing Distributions Along the Horizontal Axis 

In Chapter 7, we visualized distributions along the horizontal axis using histograms 
and density plots. Here, we will expand on this idea by staggering the distribution 
plots in the vertical direction. The resulting visualization is called a ridgeline plot, 
because these plots look like mountain ridgelines. Ridgeline plots tend to work partic¬ 
ularly well if you want to show trends in distributions over time. 

The standard ridgeline plot uses density estimates (Figure 9-9). It is quite closely 
related to the violin plot, but frequently evokes a more intuitive understanding of the 
data. For example, the two clusters of temperatures around 35 degrees and 50 degrees 
Fahrenheit in November are much more obvious in Figure 9-9 than in Figure 9-5. 

Because the x axis shows the response variable and the y axis shows the grouping 
variable, there is no separate axis for the density estimates in a ridgeline plot. Density 
estimates are shown alongside the grouping variable. This is no different from the 
violin plot, where densities are also shown alongside the grouping variable, without a 
separate, explicit scale. In both cases, the purpose of the plot is not to show specific 
density values but instead to allow for easy comparison of density shapes and relative 
heights across groups. 
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Figure 9-9. Temperatures in Lincoln, NE, in 2016, visualized as a ridgeline plot. For each 
month, we show the distribution of daily mean temperatures measured in Fahrenheit. 
Original figure concept: [Wehrwein 2017]. Data source: Weather Underground. 

In principle, we can use histograms instead of density plots in a ridgeline visualiza¬ 
tion. However, the resulting figures often don’t look very good (Figure 9-10). The 
problems are similar to those of stacked or overlapping histograms (see “Visualizing 
Multiple Distributions at the Same Time” on page 64). Because the vertical lines in 
these ridgeline histograms always appear at the exact same x values, the bars from dif¬ 
ferent histograms align with each other in confusing ways. In my opinion, it is better 
to not draw such overlapping histograms. 
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Figure 9-10. Temperatures in Lincoln, NE, in 2016, visualized as a ridgeline plot of his¬ 
tograms. The individual histograms don’t separate well visually, and the overall figure is 
quite busy and confusing. Data source: Weather Underground. 


Ridgeline plots scale to very large numbers of distributions. For example, Figure 9-11 
shows the distributions of movie lengths from 1913 to 2005. This figure contains 
almost 100 distinct distributions and yet it is very easy to read. We can see that in the 
1920s, movies came in many different lengths, but since about 1960 movie length has 
standardized to approximately 90 minutes. 
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Figure 9-11. Evolution of movie lengths over time. Since the 1960s, the majority of all 
movies have been approximately 90 minutes long. Data source: Internet Movie Database 
(IMDB). 


Ridgeline plots also work well if we want to compare two trends over time. This is a 
scenario that arises commonly if we want to analyze the voting patterns of the mem¬ 
bers of two different parties. We can make this comparison by staggering the distribu¬ 
tions vertically by time and drawing two differently colored distributions at each time 
point, representing the two parties (Figure 9-12). 
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Figure 9-12. Voting patterns in the US House of Representatives have become increas¬ 
ingly polarized. DW-NOMINATE scores are frequently used to compare voting patterns 
of representatives between parties and over time. Here, score distributions are shown for 
each Congress from 1963 to 2013 separately for Democrats and Republicans. Each Con¬ 
gress is represented by its first year. Original figure concept: [McDonald 2017]. Data 
source: Keith Poole. 
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CHAPTER 10 


Visualizing Proportions 


We often want to show how some group, entity, or amount breaks down into individ¬ 
ual pieces that each represent a proportion of the whole. Common examples include 
the proportions of men and women in a group of people, the percentages of people 
voting for different political parties in an election, or the market shares of companies. 
The archetypal such visualization is the pie chart, omnipresent in any business pre¬ 
sentation and much maligned among data scientists. As we will see, visualizing pro¬ 
portions can be challenging, in particular when the whole is broken into many 
different pieces or when we want to see changes in proportions over time or across 
conditions. There is no single ideal visualization that always works. To illustrate this 
issue, I discuss a few different scenarios that each call for a different type of 
visualization. 



Remember, you always need to pick the visualization that best fits 
your specific dataset and that highlights the key data features you 
want to show. 


A Case for Pie Charts 

From 1961 to 1983, the German parliament (called the Bundestag) was composed of 
members of three different parties, CDU/CSU, SPD, and FDP. During most of this 
time, CDU/CSU and SPD had approximately comparable numbers of seats, while 
FDP typically held only a small fraction of seats. For example, in the eighth 
Bundestag, from 1976-1980, CDU/CSU held 243 seats, SPD 214, and FDP 39, for a 
total of 496. Such parliamentary data is most commonly visualized as a pie chart 
(Figure 10-1). 
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Figure 10-1. Party composition of the eighth German Bundestag, 1976-1980, visualized 
as a pie chart. This visualization highlights that the ruling coalition ofSPD and FDP had 
a small majority over the opposition CDU/CSU. Data source: Wikipedia. 


A pie chart breaks a circle into slices such that the area of each slice is proportional to 
the fraction of the total it represents. The same procedure can be performed on a rec¬ 
tangle, and the result is a stacked bar chart (Figure 10-2). Depending on whether we 
slice the bar vertically or horizontally, we obtain vertically stacked bars (Figure 10-2a) 
or horizontally stacked bars (Figure 10-2b). 


94 | Chapter 10: Visualizing Proportions 




a 


FDP 


b 


400 


300 


TO 

CD 


200 


100 



SPD 



CDU/CSU 


CDU/CSU 


SPD 


FDP 


Figure 10-2. Party composition of the eighth German Bundestag, 1976-1980, visualized 
as stacked bars, (a) Bars stacked vertically, (b) Bars stacked horizontally. It is not imme¬ 
diately obvious that SPD and FDP jointly had more seats than CDU/CSU. Data source: 
Wikipedia. 


We can also take the bars from Figure 10-2a and place them side-by-side rather than 
stacking them on top of each other. This visualization makes it easier to perform a 
direct comparison of the three groups, though it obscures other aspects of the data 
(Figure 10-3). Most importantly, in a side-by-side bar plot the relationship of each bar 
to the total is not visually obvious. 
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Figure 10-3. Party composition of the eighth German Bundestag, 1976-1980, visualized 
as side-by-side bars. As in Figure 10-2, it is not immediately obvious that SPD and FDP 
jointly had more seats than CDU/CSU. Data source: Wikipedia. 


Many authors categorically reject pie charts and argue in favor of side-by-side or 
stacked bars. Others defend the use of pie charts in some applications. My own opin¬ 
ion is that none of these visualizations is consistently superior over any other. 
Depending on the features of the dataset and the specific story you want to tell, you 
may want to favor one or the other approach. In the case of the eighth German Bun¬ 
destag, I think that a pie chart is the best option. It highlights that the ruling coalition 
of SPD and FDP jointly had a small majority over CDU/CSU (Figure 10-1). This fact 
is not visually obvious in any of the other plots (Figures 10-2 and 10-3). 

In general, pie charts work well when the goal is to emphasize simple fractions, such 
as one-half, one-third, or one-quarter. They also work well when we have very small 
datasets. A single pie chart, as in Figure 10-1, looks just fine, but a single column of 
stacked bars, as in Figure 10-2a, looks awkward. Stacked bars, on the other hand, can 
work for side-by-side comparisons of multiple conditions or in a time series, and 
side-by-side bars are preferred when we want to directly compare the individual frac¬ 
tions to each other. A summary of the various pros and cons of pie charts, stacked 
bars, and side-by-side bars is provided in Table 10-1. 
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Table 10-1. Pros and cons of common approaches to visualizing proportions: pie charts, 
stacked bars, and side-by-side bars. 



Pie chart Stacked bars 

Side-by- 
side bars 

Clearly visualizes the data as proportions of a whole 

✓ 

✓ 

X 

Allows easy visual comparison of the relative proportions 

X 

X 

✓ 

Visually emphasizes simple fractions, such as 1/2,1/3,1/4 

✓ 

X 

X 

Looks visually appealing even for very small datasets 

✓ 

X 

✓ 

Works well when the whole is broken into many pieces 

X 

X 

✓ 

Works well for the visualization of many sets of proportions or time series of 
proportions 

X 

✓ 

X 


A Case for Side-by-Side Bars 

I will now demonstrate a case where pie charts fail. This example is modeled after a 
critique of pie charts originally posted on Wikipedia [Wikipedia 2007]. Consider the 
hypothetical scenario of five companies, A, B, C, D, and E, who all have roughly com¬ 
parable market share of approximately 20%. Our hypothetical dataset lists the market 
share of each company for three consecutive years. When we visualize this dataset 
with pie charts, it is difficult to see specific trends (Figure 10-4). It appears that the 
market share of company A is growing and the one of company E is shrinking, but 
beyond this one observation we can’t tell what’s going on. In particular, it is unclear 
how exactly the market shares of the different companies compare within each year. 



Figure 10-4. Market share of five hypothetical companies, A-E,for the years 2015-2017, 
visualized as pie charts. This visualization has two major problems: (i) a comparison of 
relative market share within years is nearly impossible, and (ii) changes in market share 
across years are difficult to see. 

The picture becomes a little clearer when we switch to stacked bars (Figure 10-5). 
Now the trends of a growing market share for company A and a shrinking market 


A Case for Side-by-Side Bars | 97 







share for company E are clearly visible. However, the relative market shares of the five 
companies within each year are still hard to compare. And it is difficult to compare 
the market shares of companies B, C, and D across years, because the bars are shifted 
relative to each other across years. This is a general problem of stacked-bar plots, and 
the main reason why I normally do not recommend this type of visualization. 
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Figure 10-5. Market share of five hypothetical companies for the years 2015-2017, 
visualized as stacked bars. This visualization has two major problems: (i) a comparison 
of relative market shares within years is difficult, and (ii) changes in market share across 
years are difficult to see for the middle companies (B, C, and D) because the location of 
the bars changes across years. 

For this hypothetical dataset, side-by-side bars are the best choice (Figure 10-6). This 
visualization highlights that both companies A and B have increased their market 
share from 2015 to 2017 while both companies D and E have reduced theirs. It also 
shows that market shares increase sequentially from company A to E in 2015 and 
similarly decrease in 2017. 
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Figure 10-6. Market share of five hypothetical companies for the years 2015-2017, 
visualized as side-by-side bars. 

A Case for Stacked Bars and Stacked Densities 

In the previous section, I wrote that I don’t normally recommend sequences of 
stacked bars, because the locations of the internal bars shift along the sequence. How¬ 
ever, the problem of shifting internal bars disappears if there are only two bars in 
each stack, and in those cases the resulting visualization can be quite clear. As an 
example, consider the proportion of women in a country’s national parliament. We 
will specifically look at the African country Rwanda, which as of 2016 tops the list of 
countries with the highest proportion of female parliament members. Rwanda has 
had a majority female parliament since 2008, and since 2013 nearly two-thirds of its 
members of parliament have been female. To visualize how the proportion of women 
in the Rwandan parliament has changed over time, we can draw a sequence of 
stacked bar graphs (Figure 10-7). This figure provides an immediate visual represen¬ 
tation of the changing proportions over time. To help the reader see exactly when the 
majority turned female, I have added a dashed horizontal line at 50%. Without this 
line, it would be near impossible to determine whether from 2003 to 2007 the major¬ 
ity was male or female. I have not added similar lines at 25% and 75%, to avoid mak¬ 
ing the figure too cluttered. 
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Figure 10-7. Change in the gender composition of the Rwandan parliament over time, 
1997 to 2016. Data source: Inter-Parliamentary Union (IPU). 


If we want to visualize how proportions change in response to a continuous variable, 
we can switch from stacked bars to stacked densities. Stacked densities can be 
thought of as the limiting case of infinitely many, infinitely small stacked bars 
arranged side-by-side. The densities in stacked density plots are typically obtained 
from kernel density estimation, as described in Chapter 7, and I refer you to that 
chapter for a general discussion of the strengths and weaknesses of this method. 

To give an example where stacked densities may be appropriate, consider the health 
status of people as a function of age. Age can be considered a continuous variable, 
and visualizing the data in this way works reasonably well (Figure 10-8). Even though 
we have four health categories here, and I’m generally not a fan of stacking multiple 
conditions, as discussed previously, I think in this case the figure is acceptable. We 
can see that overall health declines as people age, and we can also see that despite this 
trend, over half of the population remains in good or excellent health until very old 
age. 
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Figure 10-8. Health status by age. Data source: General Social Survey (GSS). 

Nevertheless, this figure has a major limitation: by visualizing the proportions of the 
four health conditions as percentages of the total, the figure obscures that there are 
many more young people than old people in the dataset. Thus, even though the per¬ 
centage of people reporting to be in good health remains approximately unchanged 
across ages spanning seven decades, the absolute number of people in good health 
declines as the total number of people at a given age declines. I will present a potential 
solution to this problem in the next section. 

Visualizing Proportions Separately as Parts of the Total 

Side-by-side bars have the problem that they don’t visualize the size of the individual 
parts relative to the whole, and stacked bars have the problem that the different bars 
cannot be compared easily because they have different baselines. We can resolve these 
two issues by making a separate plot for each part and in each plot showing the 
respective part relative to the whole. For the health dataset of Figure 10-8, this proce¬ 
dure results in Figure 10-9. The overall age distribution in the dataset is shown as the 
shaded gray areas, and the age distributions for each health status are shown in blue. 
This figure highlights that in absolute terms, the number of people with excellent or 
good health declines past ages 30-40, while the number of people with fair health 
remains approximately constant across all ages. 
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Figure 10-9. Health status by age, shown as proportion of the total number of people in 
the survey. The colored areas show the density estimates of the ages of people with the 
respective health status and the gray areas show the overall age distribution. Data 
source: GSS. 


To provide a second example, let’s consider a different variable from the same survey: 
marital status. Marital status changes much more drastically with age than does 
health status, and a stacked densities plot of marital status versus age is not very illu¬ 
minating (Figure 10-10). 
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Figure 10-10. Marital status by age. To simplify the figure, I have removed a small num¬ 
ber of cases that report as separated. I have labeled this figure as “bad” because the fre¬ 
quency of people who have never been married or are widowed changes so drastically 
with age that the age distributions of married and divorced people are highly distorted 
and difficult to interpret. Data source: GSS. 
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The same dataset visualized as partial densities is much clearer (Figure 10-11). In par¬ 
ticular, we see that the proportion of married people peaks around the late 30s, the 
proportion of divorced people peaks around the early 40s, and the proportion of wid¬ 
owed people peaks around the mid 70s. 
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Figure 10-11. Marital status by age, shown as proportion of the total number of people 
in the survey. The colored areas show the density estimates of the ages of people with the 
respective marital status, and the gray areas show the overall age distribution. Data 
source: GSS. 

However, one downside of Figure 10-11 is that this representation doesn’t make it 
easy to determine relative proportions at any given point in time. For example, if we 
wanted to know at what age more than 50% of all people surveyed are married, we 
could not easily tell from Figure 10-11. To answer this question, we can use the same 
type of display but show relative proportions instead of absolute counts along the y 
axis (Figure 10-12). Now we see that married people are in the majority starting in 
the late 20s, and widowed people are in the majority starting in the mid 70s. 
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Figure 10-12. Marital status by age, shown as proportion of the total number of people 
in the survey. The areas colored in blue show the percent of people at the given age with 
the respective status, and the areas colored in gray show the percent of people with all 
other marital statuses. Data source: GSS. 


104 | Chapter 10: Visualizing Proportions 




CHAPTER 11 


Visualizing Nested Proportions 


In the preceding chapter, I discussed scenarios where a dataset is broken into pieces 
defined by one categorical variable, such as political party, company, or health status. 
It is not uncommon, however, that we want to drill down further and break down a 
dataset by multiple categorical variables at once. For example, in the case of parlia¬ 
mentary seats, we could be interested in the proportions of seats by party and by the 
gender of the representatives. Similarly, in the case of people’s health status, we could 
ask how health status further breaks down by marital status. I refer to these scenarios 
as nested proportions, because each additional categorical variable that we add creates 
a finer subdivision of the data nested within the previous proportions. There are sev¬ 
eral suitable approaches to visualize such nested proportions, including mosaic plots, 
treemaps, and parallel sets. 

Nested Proportions Gone Wrong 

I will begin by demonstrating two flawed approaches to visualizing nested propor¬ 
tions. While these approaches may seem nonsensical to any experienced data scien¬ 
tist, I have seen them in the wild and therefore think they warrant discussion. 
Throughout this chapter, I will work with a dataset of 106 bridges in Pittsburgh. This 
dataset contains various pieces of information about the bridges, such as the material 
from which they are constructed (steel, iron, or wood) and the year when they were 
erected. Based on the year of erection, bridges are grouped into distinct categories, 
such as crafts bridges that were erected before 1870 and modern bridges that were 
erected after 1940. 

Let’s assume we want to visualize both the fraction of bridges made from steel, iron, 
or wood and the fraction that are crafts or modern. We might be tempted to do so by 
drawing a combined pie chart (Figure 11-1). However, this visualization is not valid. 
All the slices in a pie chart must add up to 100%, and here the slices add up to 135%. 
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We reach a total percentage in excess of 100% because we are double-counting 
bridges. Every bridge in the dataset is made of steel, iron, or wood, so these three sli¬ 
ces of the pie already represent 100% of the bridges. Every crafts or modern bridge is 
also a steel, iron, or wood bridge, and hence is counted twice in the pie chart. 



Figure 11-1. Breakdown of bridges in Pittsburgh by construction material (steel, wood, 
iron) and by date of construction (crafts, before 1870, and modern, after 1940), shown 
as a pie chart. Numbers represent the percentages of bridges of a given type among all 
bridges. This figure is invalid, because the percentages add up to more than 100%. There 
is overlap between construction material and date of construction. For example, all 
modern bridges are made of steel, and the majority of crafts bridges are made of wood. 
Data source: Yoram Reich and Steven ]. Fenves, via the UCI Machine Learning Reposi¬ 
tory [Dua and Karra Taniskidou 2017]. 

Double-counting is not necessarily a problem if we choose a visualization that does 
not require the proportions to sum to 100%. As discussed in the preceding chapter, 
side-by-side bars meet this criterion. We can show the various proportions of bridges 
as bars in a single plot, and this plot is not technically wrong (Figure 11-2). Neverthe¬ 
less, I have labeled it as “bad” because it does not immediately show that there is over¬ 
lap among some of the categories shown. A casual observer might conclude from 
Figure 11-2 that there are five separate categories of bridges, and that, for example, 
modern bridges are neither made of steel nor of wood or iron. 
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Figure 11-2. Breakdown of bridges in Pittsburgh by construction material (steel, wood, 
iron) and by date of construction (crafts, before 1870, and modern, after 1940), shown 
as a bar plot. Unlike Figure 11-1, this visualization is not technically wrong, since it 
doesn’t imply that the bar heights need to add up to 100%. However, it also does not 
clearly indicate the overlap among different groups, and therefore I have labeled it “bad.” 
Data source: Yoram Reich and Steven ]. Fenves. 

Mosaic Plots and Treemaps 

Whenever we have categories that overlap, it is best to show explicitly how they relate 
to each other. This can be done with a mosaic plot (Figure 11-3). On first glance, a 
mosaic plot looks similar to a stacked bar plot (e.g., Figure 10-5). However, unlike in 
a stacked bar plot, in a mosaic plot both the heights and the widths of individual sha¬ 
ded areas vary. Note that in Figure 11-3, we see two additional construction eras, 
emerging (from 1870 to 1889) and mature (1890 to 1939). In combination with crafts 
and modern, these construction eras cover all bridges in the dataset, as do the three 
building materials. This is a critical condition for a mosaic plot: every categorical 
variable shown must cover all the observations in the dataset. 
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Figure 11-3. Breakdown of bridges in Pittsburgh by construction material (steel, wood, 
iron) and by era of construction (crafts, emerging, mature, modern), shown as a mosaic 
plot. The widths of each rectangle are proportional to the number of bridges constructed 
in that era, and the heights are proportional to the number of bridges constructed from 
that material. Numbers represent the counts of bridges within each category. Data 
source: Yoram Reich and Steven ]. Fenves. 

To draw a mosaic plot, we begin by placing one categorical variable along the x axis 
(here, era of bridge construction) and subdividing the x axis by the relative propor¬ 
tions that make up the categories. We then place the other categorical variable along 
the y axis (here, building material) and, within each category along the x axis, subdi¬ 
vide the y axis by the relative proportions that make up the categories of the y vari¬ 
able. The result is a set of rectangles whose areas are proportional to the number of 
cases representing each possible combination of the two categorical variables. 

The bridges dataset can also be visualized in a related but distinct format called a tree- 
map. In a treemap, just as is the case in a mosaic plot, we take an enclosing rectangle 
and subdivide it into smaller rectangles whose areas represent the proportions. How¬ 
ever, the method of placing the smaller rectangles into the larger one is different com¬ 
pared to the mosaic plot. In a treemap, we recursively nest rectangles inside each 
other. For example, in the case of the Pittsburgh bridges, we can first subdivide the 
total area into three parts representing the three building materials, wood, iron, and 
steel. Then we can subdivide each of those areas further to represent the construction 
eras represented for each building material (Figure 11-4). In principle we could keep 
going, nesting ever more smaller subdivisions inside each other, though relatively 
quickly the result would become unwieldy or confusing. 
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Figure 11-4. Breakdown of bridges in Pittsburgh by construction material (steel, wood, 
iron) and by era of construction (crafts, emerging, mature, modern), shown as a tree- 
map. The area of each rectangle is proportional to the number of bridges of that type. 
Data source: Yoram Reich and Steven J. Fenves. 

While mosaic plots and treemaps are closely related, they have different points of 
emphasis and different application areas. Here, the mosaic plot (Figure 11-3) empha¬ 
sizes the temporal evolution in building material use from the crafts era to the 
modern era, whereas the treemap (Figure 11-4) emphasizes the total number of steel, 
iron, and wood bridges. 

More generally, mosaic plots assume that all of the proportions shown can be identi¬ 
fied via combinations of two or more orthogonal categorical variables. For example, 
in Figure 11-3, every bridge can be described by a choice of building material (wood, 
iron, steel) and a choice of time period (crafts, emerging, mature, modern). More¬ 
over, in principle every combination of these two variables is possible, even though in 
practice this need not be the case. (Here, there are no steel crafts bridges and no wood 
or iron modern bridges.) By contrast, such a requirement does not exist for treemaps. 
In fact, treemaps tend to work well when the proportions cannot meaningfully be 
described by combining multiple categorical variables. For example, we can separate 
the US into four regions (West, Northeast, Midwest, and South) and each region into 
distinct states, but the states in one region have no relationship to the states in 
another region (Figure 11-5). 
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Figure 11-5. States in the US visualized as a treemap. Each rectangle represents one 
state, and the area of each rectangle is proportional to the state’s land surface area. The 
states are grouped into four regions, West, Northeast, Midwest, and South. The coloring 
is proportional to the number of inhabitants for each state, with darker colors represent¬ 
ing larger numbers of inhabitants. Data source: 2010 US Decennial Census. 

Both mosaic plots and treemaps are commonly used and can be illuminating, but 
they have similar limitations to stacked bars (Table 10-1); namely, a direct compari¬ 
son among conditions can be difficult, because different rectangles do not necessarily 
share baselines that enable visual comparison. In mosaic plots or treemaps, this prob¬ 
lem is exacerbated by the fact that the shapes of the different rectangles can vary. For 
example, there are the same number of iron bridges (three) among the emerging and 
mature bridges, but this is difficult to discern in the mosaic plot (Figure 11-3) because 
the two rectangles representing these two groups of three bridges have entirely differ¬ 
ent shapes. There isn’t necessarily a solution to this problem—visualizing nested pro¬ 
portions can be tricky. Whenever possible, I recommend showing the actual counts 
or percentages on the plot, so readers can verify that their intuitive interpretation of 
the shaded areas is correct. 
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Nested Pies 

At the beginning of this chapter, I visualized the bridges dataset with a flawed pie 
chart (Figure 11-1), and I then argued that a mosaic plot or a treemap is more appro¬ 
priate. However, both of these latter plot types are closely related to pie charts, since 
they all use area to represent data values. The primary difference is the type of coordi¬ 
nate system: polar in the case of a pie chart versus Cartesian in the case of a mosaic 
plot or treemap. This close relationship between these different plots begs the ques¬ 
tion of whether some variant of a pie chart can be used to visualize this dataset. 

There are two possibilities. First, we can draw a pie chart composed of an inner and 
an outer circle (Figure 11-6). The inner circle shows the breakdown of the data by 
one variable (here, building material), and the outer circle shows the breakdown of 
each slice of the inner circle by the second variable (here, era of bridge construction). 
This visualization is reasonable, but I have my reservations, and therefore I have 
labeled it “ugly.” Most importantly, the two separate circles obscure the fact that each 
bridge in the dataset has both a building material and an era of construction. In 
effect, in Figure 11-6, we are still double-counting each bridge. If we add up all the 
numbers shown in the two circles we obtain 212, which is twice the number of 
bridges in the dataset. 

Alternatively, we can first slice the pie into pieces representing the proportions 
according to one variable (e.g., material) and then subdivide these slices further 
according to the other variable (construction era) (Figure 11-7). In this way, in effect 
we are making a normal pie chart with a large number of small pie slices. However, 
we can then use coloring to indicate the nested nature of the pie. In Figure 11-7, 
green colors represent wood bridges, orange colors represent iron bridges, and blue 
colors represent steel bridges. The darkness of each color represents the construction 
era, with darker colors corresponding to more recently constructed bridges. By using 
a nested color scale in this way, we can visualize the breakdown of the data both by 
the primary variable (construction material) and by the secondary variable (construc¬ 
tion era). 
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Figure 11-6. Breakdown of bridges in Pittsburgh by construction material (steel, wood, 
iron; inner circle) and by era of construction (crafts, emerging, mature, modern; outer 
circle). Numbers represent the counts of bridges within each category. Data source: 

Yoram Reich and Steven J. Femes. 

The pie chart of Figure 11-7 represents a reasonable visualization of the bridges data¬ 
set, but in a direct comparison to the equivalent treemap (Figure 11-4) I think the 
treemap is preferable, for two reasons. First, the rectangular shape of the treemap 
allows it to make better use of the available space. Figures 11-4 and 11-7 are of exactly 
equal size, but in Figure 11-7 much of the figure is wasted as whitespace. Figure 11-4, 
the treemap, has virtually no superfluous whitespace. This matters because it enables 
me to place the labels inside the shaded areas in the treemap. Inside labels always cre¬ 
ate a stronger visual unit with the data than outside labels and hence are preferred. 
Second, some of the pie slices in Figure 11-7 are very thin and thus hard to see. By 
contrast, every rectangle in Figure 11-4 is of a reasonable size. 
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Figure 11-7. Breakdown of bridges in Pittsburgh by construction material (steel, wood, 
iron) and by era of construction (crafts, emerging, mature, modern). Numbers represent 
the counts of bridges within each category. Data source: Yoram Reich and Steven J. 
Fenves. 

Parallel Sets 

When we want to visualize proportions described by more than two categorical vari¬ 
ables, mosaic plots, treemaps, and pie charts all can quickly become unwieldy. A via¬ 
ble alternative in this case can be a parallel sets plot. In a parallel sets plot, we show 
how the total dataset breaks down by each individual categorical variable, and then 
we draw shaded bands that show how the subgroups relate to each other. See 
Figure 11-8 for an example. In this figure, I have broken down the bridges dataset by 
construction material (iron, steel, wood), the length of each bridge (long, medium, 
short), the era during which each bridge was constructed (crafts, emerging, mature, 
modern), and the river each bridge spans (Allegheny, Monongahela, Ohio). The 
bands that connect the parallel sets are colored by construction material. This shows, 
for example, that wood bridges are mostly of medium length (with a few short 
bridges), were primarily erected during the crafts period (with a few bridges of 
medium length erected during the emerging and mature periods), and span primarily 
the Allegheny river (with a few crafts bridges spanning the Monongahela river). By 
contrast, iron bridges are all of medium length, were primarily erected during the 
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crafts period, and span the Allegheny and Monongahela rivers in approximately 
equal proportions. 
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Figure 11-8. Breakdown of bridges in Pittsburgh by construction material, length, era of 
construction, and the river they span, shown as a parallel sets plot. The coloring of the 
bands highlights the construction material of the different bridges. Data source: Yoram 
Reich and Steven J. Fenves. 


The same visualization looks quite different if we color by a different criterion, such 
as by river (Figure 11-9). This figure is visually busy, with many crisscrossing bands, 
but we do see that nearly any bridge of any type can be found to span each river. 
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Figure 11-9. Breakdown of bridges in Pittsburgh by construction material, length, era of 
construction, and the river they span. This figure is similar to Figure 11-8, but now the 
coloring of the bands highlights the river spanned by the different bridges. This figure is 
labeled “ugly” because the arrangement of the colored bands in the middle of the figure is 
very busy, and also because the bands need to be read from right to left. Data source: 
Yoram Reich and Steven J. Femes. 


I have labeled Figure 11-9 as “ugly” because I think it is overly complex and confus¬ 
ing. First, since we are used to reading from left to right I think the sets that define 
the coloring should appear all the way to the left, not on the right. This will make it 
easier to see where the coloring originates and how it flows through the dataset. Sec¬ 
ond, it is a good idea to change the order of the sets such that the number of criss¬ 
crossing bands is minimized. Following these principles, I arrive at Figure 11-10, 
which I consider preferable to Figure 11-9. 
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Figure 11-10. Breakdown of bridges in Pittsburgh by river, era of construction, length, 
and construction material. This figure differs from Figure 11-9 only in the order of the 
parallel sets. The modified order results in a figure that is easier to read and less busy. 
Data source: Yoram Reich and Steven J. Fenves. 
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CHAPTER 12 


Visualizing Associations Among Two or 
More Quantitative Variables 


Many datasets contain two or more quantitative variables, and we may be interested 
in how these variables relate to each other. For example, we may have a dataset of 
quantitative measurements of different animals, such as the animals’ height, weight, 
length, and daily energy demands. To plot the relationship of just two such variables, 
such as the height and weight, we will normally use a scatterplot. If we want to show 
more than two variables at once, we may opt for a bubble chart, a scatterplot matrix, 
or a correlogram. Finally, for very high-dimensional datasets, it may be useful to per¬ 
form dimension reduction, for example in the form of principal components analysis. 

Scatterplots 

I will demonstrate the basic scatterplot and several variations thereof using a dataset 
of measurements performed on 123 blue jay birds. The dataset contains information 
such as the head length (measured from the tip of the bill to the back of the head), the 
skull size (head length minus bill length), and the body mass of each bird. We expect 
that there are relationships between these variables. For example, birds with longer 
bills would be expected to have larger skull sizes, and birds with higher body mass 
should have larger bills and skulls than birds with lower body mass. 

To explore these relationships, I begin with a plot of head length against body mass 
(Figure 12-1). In this plot, head length is shown along the y axis and body mass along 
the x axis, and each bird is represented by one dot. (Note the terminology: we say that 
we plot the variable shown along the y axis against the variable shown along the x 
axis.) The dots form a dispersed cloud (hence the term scatterplot), yet undoubtedly 
there is a trend for birds with higher body mass to have longer heads. The bird with 
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the longest head falls close to the maximum body mass observed, and the bird with 
the shortest head falls close to the minimum body mass observed. 
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Figure 12-1. Head length (measured from the tip of the hill to the back of the head, in 
mm) versus body mass (in grams), for 123 blue jays. Each dot corresponds to one bird. 
There is a moderate tendency for heavier birds to have longer heads. Data source: Keith 
Tarvin, Oberlin College. 


The blue jay dataset contains both male and female birds, and we may want to know 
whether the overall relationship between head length and body mass holds up sepa¬ 
rately for each sex. To address this question, we can color the points in the scatterplot 
by the sex of the bird (Figure 12-2). This figure reveals that the overall trend in head 
length and body mass is at least in part driven by the sex of the birds. At the same 
body mass, females tend to have shorter heads than males. At the same time, females 
tend to be lighter than males on average. 

Because the head length is defined as the distance from the tip of the bill to the back 
of the head, a larger head length could imply a longer bill, a larger skull, or both. We 
can disentangle bill length and skull size by looking at another variable in the dataset, 
the skull size, which is similar to the head length but excludes the bill. As we are 
already using the x position for body mass, the y position for head length, and the dot 
color for bird sex, we need another aesthetic to which we can map skull size. One 
option is to use the size of the dots, resulting in a visualization called a bubble chart 
(Figure 12-3). 
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Figure 12-2. Head length versus body mass for 123 blue jays. The birds’ sex is indicated 
by color. At the same body mass, male birds tend to have longer heads (and specifically, 
longer bills) than female birds. Data source: Keith Tarvin, Oberlin College. 
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Figure 12-3. Head length versus body mass for 123 blue jays. The birds’ sex is indicated 
by color and the birds’ skull size by symbol size. Head length measurements include the 
length of the bill while skull size measurements do not. Head length and skull size tend 
to be correlated, but there are some birds with unusually long or short bills given their 
skull size. Data source: Keith Tarvin, Oberlin College. 
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Bubble charts have the disadvantage that they show the same types of variables— 
quantitative variables—with two different types of scales, position and size. This 
makes it difficult to visually ascertain the strengths of associations between the vari¬ 
ous variables. Moreover, differences between data values encoded as bubble size are 
harder to perceive than differences between data values encoded as position. Because 
even the largest bubbles need to be somewhat small compared to the total figure size, 
the size differences between the largest and the smallest bubbles are necessarily small. 
Consequently, smaller differences in data values will correspond to very small size 
differences that can be virtually impossible to see. In Figure 12-3, 1 used a size map¬ 
ping that visually amplifies the difference between the smallest skulls (around 28 
mm) and the largest skulls (around 34 mm), and yet it is difficult to determine what 
the relationship is between skull size and either body mass or head length. 

As an alternative to a bubble chart, it may be preferable to show an all-against-all 
matrix of scatterplots, where each individual plot shows two data dimensions 
(Figure 12-4). This figure shows clearly that the relationship between skull size and 
body mass is comparable for female and male birds, except that the female birds tend 
to be somewhat smaller. However, the same is not true for the relationship between 
head length and body mass. There is a clear separation by sex. Male birds tend to 
have longer bills than female birds, all else being equal. 
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Figure 12-4. All-against-all scatterplot matrix of head length, body mass, and skull size, 
for 123 blue jays. This figure shows the exact same data as Figure 12-2. Because we are 
better at judging position than symbol size, correlations between skull size and the other 
two variables are easier to perceive in the pairwise scatterplots than in Figure 12-2. Data 
source: Keith Tarvin, Oberlin College. 


Correlograms 

When we have more than three to four quantitative variables, all-against-all scatter¬ 
plot matrices quickly become unwieldy. In this case, it is more useful to quantify the 
amount of association between pairs of variables and visualize these quantities rather 
than the raw data. One common way to do this is to calculate correlation coefficients. 
The correlation coefficient r is a number between -1 and 1 that measures to what 
extent two variables covary. A value of r = 0 means there is no association whatsoever, 
and a value of either 1 or -1 indicates a perfect association. The sign of the correla¬ 
tion coefficient indicates whether the variables are correlated (larger values in one 
variable coincide with larger values in the other) or anticorrelated (larger values in 
one variable coincide with smaller values in the other). To provide visual examples of 
what different correlation strengths look like, in Figure 12-5 I show randomly 
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generated sets of points that differ widely in the degree to which the x and y values 
are correlated. 



Figure 12-5. Examples of correlations of different magnitude and direction, with associ¬ 
ated correlation coefficient r. In both rows, from left to right correlations go from weak to 
strong. In the top row the correlations are positive (larger values for one quantity are 
associated with larger values for the other) and in the bottom row they are negative 
(larger values for one quantity are associated with smaller values for the other). In all six 
panels, the sets ofx and y values are identical, but the pairings between individual x and 
y values have been reshuffled to generate the specified correlation coefficients. 

The correlation coefficient is defined as: 

\jhiyi-y) 2 

where x, and y, are two sets of observations and x and y are the corresponding sample 
means. We can make a number of observations from this formula. First, the formula 
is symmetric in x, and y t , so the correlation of x with y is the same as the correlation 
of y with x. Second, the individual values x, and y, only enter the formula in the con¬ 
text of differences from the respective sample mean, so if we shift an entire dataset by 
a constant amount—for example, if we replace x, with x. = x■ + C for some constant 
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C—the correlation coefficient remains unchanged. Third, the correlation coefficient 
also remains unchanged if we rescale the data (e.g., x' = Cx,), since the constant C will 
appear both in the numerator and the denominator of the formula and hence can be 
canceled. 

Visualizations of correlation coefficients are called correlograms. To illustrate the use 
of a correlogram, we will consider a dataset of over 200 glass fragments obtained dur¬ 
ing forensic work. For each glass fragment, we have measurements about its composi¬ 
tion, expressed as the percent in weight of various mineral oxides. There are seven 
different oxides for which we have measurements, yielding a total of6 + 5 + 4 + 3 + 
2 + 1 = 21 pairwise correlations. We can display these 21 correlations at once as a 
matrix of colored tiles, where each tile represents one correlation coefficient 
(Figure 12-6). This correlogram allows us to quickly grasp trends in the data, such as 
that magnesium is negatively correlated with nearly all other oxides, and that alumi¬ 
num and barium have a strong positive correlation. 



Figure 12-6. Correlations in mineral content for 214 samples of glass fragments obtained 
during forensic work. The dataset contains seven variables measuring the amounts of 
magnesium (Mg), calcium (Ca), iron (Fe), potassium (K), sodium (Na), aluminum (Al), 
and barium (Ba) found in each glass fragment. The colored tiles represent the correla¬ 
tions between pairs of these variables. Data source: B. German. 

One weakness of the correlogram of Figure 12-6 is that low correlations—i.e., correla¬ 
tions with absolute value near zero—are not as visually suppressed as they should be. 
For example, magnesium (Mg) and potassium (K) are not at all correlated, but 
Figure 12-6 doesn’t immediately show this. To overcome this limitation, we can 
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display the correlations as colored circles and scale the circle size with the absolute 
value of the correlation coefficient (Figure 12-7). In this way, low correlations are 
suppressed and high correlations stand out better. 


Mg Ca Fe K Na Al 



Figure 12-7. Correlations in mineral content for forensic glass samples. The color scale is 
identical to Figure 12-6. However, now the magnitude of each correlation is also encoded 
in the size of the colored circles. This choice visually deemphasizes cases with correlations 
near zero. Data source: B. German. 

All correlograms have one important drawback: they are fairly abstract. While they 
show us important patterns in the data, they also hide the underlying data points and 
may cause us to draw incorrect conclusions. It is always better to visualize the raw 
data rather than abstract derived quantities that have been calculated from it. Fortu¬ 
nately, we can frequently find a middle ground between showing important patterns 
and showing the raw data by applying techniques of dimension reduction. 

Dimension Reduction 

Dimension reduction relies on the key insight that most high-dimensional datasets 
consist of multiple correlated variables that convey overlapping information. Such 
datasets can be reduced to a smaller number of key dimensions without loss of much 
critical information. As a simple, intuitive example, consider a dataset of multiple 
physical traits of people, including quantities such as each persons height and weight, 
the lengths of their arms and legs, the circumferences of their waist, hips, and chest, 
etc. We can understand intuitively that all these quantities will relate first and fore¬ 
most to the overall size of each person. All else being equal, a larger person will be 
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taller, weigh more, have longer arms and legs, and have larger waist, hip, and chest 
circumferences. The next important dimension is going to be the person’s sex. Male 
and female measurements are substantially different for persons of comparable size. 
For example, a woman will tend to have higher hip circumference than a man, all else 
being equal. 

There are many techniques for dimension reduction. I will discuss only one techni¬ 
que here, the most widely used one, called principal components analysis (PCA). PCA 
introduces a new set of variables, called principal components (PCs), by linear com¬ 
bination of the original variables in the data, standardized to zero mean and unit var¬ 
iance (see Figure 12-8 for a toy example in two dimensions). The PCs are chosen 
such that they are uncorrelated, and they are ordered such that the first component 
captures the largest possible amount of variation in the data and subsequent compo¬ 
nents capture increasingly less. Usually, key features in the data can be seen from only 
the first two or three PCs. 
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Figure 12-8. Example principal components analysis in two dimensions, (a) The original 
data. As example data, I am using the head length and skull size measurements from the 
blue jays dataset. Female and male birds are distinguished by color, but this distinction 
has no effect on the PCA. (b) As the first step in PCA, we scale the original data values 
to zero mean and unit variance. We then define new variables (the principal compo¬ 
nents) along the directions of maximum variation in the data, (c) Finally, we project the 
data into the new coordinates. Mathematically, this projection is equivalent to a rotation 
of the data points around the origin. In the 2D example shown here, the data points are 
rotated clockwise by 45 degrees. Data source: Keith Tarvin, Oberlin College. 


When we perform PCA, we are generally interested in two pieces of information: the 
composition of the PCs and the locations of the individual data points in the princi¬ 
pal components space. Let’s look at these two pieces in a PCA of the forensic glass 
dataset. 
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First, we look at the component composition (Figure 12-9). Here, we only consider 
the first two components, PC 1 and PC 2. Because the PCs are linear combinations of 
the original variables (after standardization), we can represent the original variables 
as arrows indicating to what extent they contribute to the PCs. Here, we see that 
barium and sodium contribute primarily to PC 1 and not to PC 2, calcium and potas¬ 
sium contribute primarily to PC 2 and not to PC 1, and the other variables contribute 
in varying amounts to both components. The arrows are of varying lengths because 
there are more than two PCs. For example, the arrow for iron is particularly short 
because it contributes primarily to higher-order PCs (not shown). 



PCI 


Figure 12-9. Composition of the first two components in a principal components analysis 
of the forensic glass dataset. Component one (PC 1) measures primarily the amount of 
aluminum, barium, sodium, and magnesium in a glass fragment, whereas component 
two (PC 2) measures primarily the amount of calcium and potassium, and to some 
extent the amount of aluminum and magnesium. Data source: B. German. 

Next, we project the original data into the principal components space (Figure 12-10). 
We see a defined clustering of distinct types of glass fragments in this plot. Fragments 
from both headlamps and windows fall into clearly delineated regions in the PC plot, 
with few outliers. Fragments from tableware and from containers are a little more 
spread out, but nevertheless clearly distinct from both headlamp and window frag¬ 
ments. By comparing Figure 12-10 with Figure 12-9, we can conclude that window 
samples tend to have higher than average magnesium content and lower than average 


126 | Chapter 12: Visualizing Associations Among Two or More Quantitative Variables 



barium, aluminum, and sodium content, whereas the opposite is true for headlamp 
samples. 



PC 1 


fragment type 

° headlamp 
o tableware 
A container 
o window 


Figure 12-10. Composition of individual glass fragments visualized in the principal com¬ 
ponents space defined in Figure 12-9. We see that the different types of glass samples 
cluster at characteristic values of PCs 1 and 2. In particular, headlamps are character¬ 
ized by a negative PC 1 value whereas windows tend to have a positive PC 1 value. 
Tableware and containers have PC 1 values close to zero and tend to have positive PC 2 
values. However, there are a few exceptions where container fragments have both a neg¬ 
ative PC 1 value and a negative PC 2 value. These are fragments whose composition 
drastically differs from all other fragments analyzed. Data source: B. German. 


Paired Data 

A special case of multivariate quantitative data is paired data: data where there are 
two or more measurements of the same quantity under slightly different conditions. 
Examples include two comparable measurements on each subject (e.g., the length of 
the right and the left arm of a person), repeat measurements on the same subject at 
different time points (e.g., a person’s weight at two different times during the year), or 
measurements on two closely related subjects (e.g., the heights of two identical twins). 
For paired data, it is reasonable to assume that the two measurements belonging to 
each pair are more similar to each other than to the measurements belonging to other 
pairs. Two twins will be approximately of the same height but will differ in height 
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from other twins. Therefore, for paired data, we need to choose visualizations that 
highlight any differences between the paired measurements. 

An excellent choice in this case is a simple scatterplot on top of a diagonal line mark¬ 
ing x= y. In such a plot, if the only difference between the two measurements of each 
pair is random noise, then all points in the sample will be scattered symmetrically 
around this line. Any systematic differences between the paired measurements, by 
contrast, will be visible in a systematic shift of the data points up or down relative to 
the diagonal. As an example, consider the carbon dioxide (C0 2 ) emissions per per¬ 
son, measured for 166 countries both in 1970 and in 2010 (Figure 12-11). This exam¬ 
ple highlights two common features of paired data. First, most points are relatively 
close to the diagonal line. Even though C0 2 emissions vary over nearly four orders of 
magnitude among countries, they are fairly consistent within each country over a 40- 
year time span. Second, the points are systematically shifted upwards relative to the 
diagonal line. The majority of countries have seen an increase in C0 2 emissions over 
the 40 years considered. 



1970 C0 2 emissions (tons / person) 

Figure 12-11. Carbon dioxide emissions per person in 1970 and 2010, for 166 countries. 
Each dot represents one country. The diagonal line represents identical C0 2 emissions in 
1970 and 2010. The points are systematically shifted upwards relative to the diagonal 
line: in the majority of countries, emissions were higher in 2010 than in 1970. Data 
source: Carbon Dioxide Information Analysis Center. 
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Scatterplots such as Figure 12-11 work well when we have a large number of data 
points and/or are interested in a systematic deviation of the entire dataset from the 
null expectation. By contrast, if we have only a small number of observations and are 
primarily interested in the identity of each individual case, a slopegraph may be a bet¬ 
ter choice. In a slopegraph, we draw individual measurements as dots arranged into 
two columns and indicate pairings by connecting the paired dots with a line. The 
slope of each line highlights the magnitude and direction of change. Figure 12-12 
uses this approach to show the 10 countries with the largest difference in C0 2 emis¬ 
sions per person from 2000 to 2010. 
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Figure 12-12. Carbon dioxide emissions per person in 2000 and 2010, for the 10 coun¬ 
tries with the largest difference between these 2 years. Data source: Carbon Dioxide 
Information Analysis Center. 


Slopegraphs have one important advantage over scatterplots: they can be used to 
compare more than two measurements at a time. For example, we can modify 
Figure 12-12 to show C0 2 emissions at three time points, here the years 2000, 2005, 
and 2010 (Figure 12-13). This choice highlights countries with a large change in 
emissions over the entire decade as well as countries such as Qatar or Trinidad and 
Tobago for which there is a large difference in the trend seen for the first five-year 
interval and the second one. 
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Figure 12-13. C0 2 emissions per person in 2000, 2005, and 2010, for the 10 countries 
with the largest difference between the years 2000 and 2010. Data source: Carbon Diox¬ 
ide Information Analysis Center. 
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CHAPTER 13 


Visualizing Time Series and Other Functions 

of an Independent Variable 


The preceding chapter discussed scatterplots, where we plot one quantitative variable 
against another. A special case arises when one of the two variables can be thought of 
as time, because time imposes additional structure on the data. Now the data points 
have an inherent order; we can arrange the points in order of increasing time and 
define a predecessor and successor for each data point. We frequently want to visual¬ 
ize this temporal order, and we do so with line graphs. Line graphs are not limited to 
time series, however. They are appropriate whenever one variable imposes an order¬ 
ing on the data. This scenario arises also, for example, in a controlled experiment 
where a treatment variable is purposefully set to a range of different values. If we have 
multiple variables that depend on time, we can either draw separate line plots or we 
can draw a regular scatterplot and then draw lines to connect the neighboring points 
in time. 

Individual Time Series 

As a first demonstration of a time series, we will consider the pattern of monthly pre¬ 
print submissions in biology. Preprints are scientific articles that researchers post 
online before formal peer review and publication in a scientific journal. The preprint 
server bioRxiv, which was founded in November 2013 specifically for researchers 
working in the biological sciences, has seen substantial growth in monthly submis¬ 
sions since. We can visualize this growth by making a form of scatterplot (Chap¬ 
ter 12) where we draw dots representing the number of submissions in each month 
(Figure 13-1). 
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Figure 13-1. Monthly submissions to the preprint server bioRxiv, from its inception in 
November 2013 until April 2018. Each dot represents the number of submissions in one 
month. There has been a steady increase in submission volume throughout the entire 
4.5-year period. Data source: Jordan Anaya, http://www.prepubmed.org. 


There is an important difference, however, between Figure 13-1 and the scatterplots 
discussed in Chapter 12. In Figure 13-1, the dots are spaced evenly along the x axis, 
and there is a defined order among them. Each dot has exactly one left and one right 
neighbor (except the leftmost and rightmost points, which have only one neighbor 
each). We can visually emphasize this order by connecting neighboring points with 
lines (Figure 13-2). Such a plot is called a line graph. 
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Figure 13-2. Monthly submissions to the preprint server bioRxiv, shown as dots connec¬ 
ted by lines. The lines do not represent data and are only meant as a guide to the eye. By 
connecting the individual dots with lines, we emphasize that there is an order between 
the dots: each dot has exactly one neighbor that comes before it and one that comes 
after. Data source: Jordan Anaya, http://www.prepubmed.org. 

Some people object to drawing lines between points because the lines do not repre¬ 
sent observed data. In particular, if there are only a few observations spaced far apart, 
had observations been made at intermediate times they would probably not have 
fallen exactly onto the lines shown. Thus, in a sense, the lines correspond to made-up 
data. Yet they may help with perception when the points are spaced far apart or are 
unevenly spaced. We can somewhat resolve this dilemma by pointing it out in the fig¬ 
ure caption, for example by writing “lines are meant as a guide to the eye” (see cap¬ 
tion of Figure 13-2). 

Using lines to represent time series is generally accepted practice, however, and fre¬ 
quently the dots are omitted altogether (Figure 13-3). Without dots, the figure places 
more emphasis on the overall trend in the data and less on individual observations. A 
figure without dots is also visually less busy. In general, the denser the time series, the 
less important it is to show individual observations with dots. For the preprint dataset 
shown here, I think omitting the dots is fine. 
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Figure 13-3. Monthly submissions to the preprint server bioRxiv, shown as a line graph 
without dots. Omitting the dots emphasizes the overall temporal trend while deempha¬ 
sizing individual observations at specific time points. It is particularly useful when the 
time points are spaced very densely. Data source: Jordan Anaya, 
http://www.prepubmed.org. 

We can also fill the area under the curve with a solid color (Figure 13-4). This choice 
further emphasizes the overarching trend in the data, because it visually separates the 
area above the curve from the area below. However, this visualization is only valid if 
they axis starts at zero, so that the height of the shaded area at each time point repre¬ 
sents the data value at that time point. 
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Figure 13-4. Monthly submissions to the preprint server bioRxiv, shown as a line graph 
with filled area underneath. By filling the area under the curve, we put even more 
emphasis on the overarching temporal trend than if we just draw a line (Figure 13-3). 
Data source: Jordan Anaya, http://www.prepubmed.org. 

Multiple Time Series and Dose-Response Curves 

We often have multiple time courses that we want to show at once. In this case, we 
have to be more careful in how we plot the data, because the figure can become con¬ 
fusing or difficult to read. For example, if we want to show the monthly submissions 
to multiple preprint servers, a scatterplot is not a good idea, because the individual 
time courses run into each other (Figure 13-5). Connecting the dots with lines allevi¬ 
ates this issue (Figure 13-6). 
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Figure 13-5. Monthly submissions to three preprint servers covering biomedical research: 
bioRxiv, the q-bio section ofarXiv, and Peer] Preprints. Each dot represents the number 
of submissions in one month to the respective preprint server. This figure is labeled “bad” 
because the three time courses visually interfere with each other and are difficult to read. 
Data source: Jordan Anaya, http://www.prepubmed.org. 
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Figure 13-6. Monthly submissions to three preprint servers covering biomedical research. 
By connecting the dots in Figure 13-5 with lines, we help the viewer follow each individ¬ 
ual time course. Data source: Jordan Anaya, http://www.prepubmed.org. 
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Figure 13-6 represents an acceptable visualization of the preprints dataset. However, 
the separate legend creates unnecessary cognitive load. We can reduce this cognitive 
load by labeling the lines directly (Figure 13-7). I have also eliminated the individual 
dots in this figure, for a result that is much more streamlined and easy to read than 
the original starting point, Figure 13-5. 



Figure 13-7. Monthly submissions to three preprint servers covering biomedical research. 
Directly labeling the lines instead of providing a legend reduces the cognitive load 
required to read the figure, and eliminating the legend removes the need for points of 
different shapes. This enables us to streamline Figure 13-6 further by eliminating the 
dots. Data source: Jordan Anaya, http://www.prepubmed.org. 

Line graphs are not limited to time series. They are appropriate whenever the data 
points have a natural order that is reflected in the variable shown along the x axis, so 
that neighboring points can be connected with a line. This situation arises, for exam¬ 
ple, in dose-response curves, where we measure how changing some numerical 
parameter in an experiment (the dose) affects an outcome of interest (the response). 
Figure 13-8 shows a classic experiment of this type, measuring oat yield in response 
to increasing amounts of fertilization. The line graph visualization highlights how the 
dose-response curves have a similar shape for the three oat varieties considered but 
differ in the starting point in the absence of fertilization (i.e., some varieties have nat¬ 
urally higher yield than others). 
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Figure 13-8. Dose-response curve showing the mean yield of oat varieties after fertiliza¬ 
tion with manure. The manure serves as a source of nitrogen, and oat yields generally 
increase as more nitrogen is available, regardless of variety. Here, manure application is 
measured in cwt (hundredweight) per acre. The hundredweight is an old imperial unit 
equal to 112 lbs or 50.8 kg. Data source: [Yates 1935], 


Time Series of Two or More Response Variables 

In the preceding examples we dealt with time courses of only a single response vari¬ 
able (e.g., preprint submissions per month or oat yield). It is not unusual, however, to 
have more than one response variable. Such situations arise commonly in macroeco¬ 
nomics. For example, we may be interested in the change in house prices from the 
previous 12 months as it relates to the unemployment rate. We may expect that house 
prices rise when the unemployment rate is low, and vice versa. 

With the tools from the preceding sections, we can visualize such data as two separate 
line graphs stacked on top of each other (Figure 13-9). This plot directly shows the 
two variables of interest, and it is straightforward to interpret. However, because the 
two variables are shown as separate line graphs, drawing comparisons between them 
can be cumbersome. If we want to identify temporal regions when both variables 
move in the same or in opposite directions, we need to switch back and forth between 
the two graphs and compare the relative slopes of the two curves. 
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Figure 13-9. Twelve-month change in house prices (a) and unemployment rate (b) over 
time, from January 2001 through December 2017. Data sources: Freddie Mac House 
Prices Index, US Bureau of Labor Statistics. 


As an alternative to showing two separate line graphs, we can plot the two variables 
against each other, drawing a path that leads from the earliest time point to the latest 
(Figure 13-10). Such a visualization is called a connected scatterplot, because we are 
technically making a scatterplot of the two variables against each other and then are 
connecting neighboring points. Physicists and engineers often call this a phase por¬ 
trait, because in their disciplines it is commonly used to represent movement in phase 
space. We have previously encountered connected scatterplots in Chapter 3, where I 
plotted the daily temperature normals in Houston, TX, versus those in San Diego, CA 
(Figure 3-3). 
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Figure 13-10. Twelve-month change in house prices versus unemployment rate, from 
January 2001 through December 2017, shown as a connected scatterplot. Darker shades 
represent more recent months. The anticorrelation seen in Figure 13-9 between the 
change in house prices and the unemployment rate causes the connected scatterplot to 
form two counterclockwise circles. Original figure concept: Len Kiefer. Data sources: 
Freddie Mac House Price Index, US Bureau of Labor Statistics. 

In a connected scatterplot, lines going in the direction from the lower left to the 
upper right represent correlated movement between the two variables (as one variable 
grows, so does the other), and lines going in the perpendicular direction, from the 
upper left to the lower right, represent anticorrelated movement (as one variable 
grows, the other shrinks). If the two variables have a somewhat cyclic relationship, we 
will see circles or spirals in the connected scatterplot. In Figure 13-10, we see one 
small circle from 2001 through 2005 and one large circle for the remainder of the 
time course. 

When drawing a connected scatterplot, it is important that we indicate both the 
direction and the temporal scale of the data. Without such hints, the plot can turn 
into a meaningless scribble (Figure 13-11). In Figure 13-10 I used a gradual darken¬ 
ing of the color to indicate direction; alternatively, one could draw arrows along the 
path. 

Is it better to use a connected scatterplot or two separate line graphs? Separate line 
graphs tend to be easier to read, but once people are used to connected scatterplots 


140 | Chapter 13: Visualizing Time Series and Other Functions of an Independent Variable 





they may be able to extract certain patterns (such as cyclical behavior with some 
irregularity) that can be difficult to spot in line graphs. In fact, to me the cyclical rela¬ 
tionship between change in house prices and unemployment rate is hard to spot in 
Figure 13-9, but the counterclockwise spiral in Figure 13-10 reveals it. Research 
reports that readers are more likely to confuse order and direction in a connected 
scatterplot than in line graphs, and less likely to report correlation [Haroz, Kosara, 
and Franconeri 2016]. On the flip side, connected scatterplots seem to result in 
higher engagement, and thus such plots may be effective tools to draw readers into a 
story. 



Figure 13-11. Twelve-month change in house prices versus unemployment rate, from 
January 2001 through December 2017. This figure is labeled “bad” because without the 
date markers and color shading of Figure 13-10, we can see neither the direction nor the 
speed of change in the data. Data sources: Freddie Mac House Prices Index, US Bureau 
of Labor Statistics. 


Even though connected scatterplots can show only two variables at a time, we can 
also use them to visualize higher-dimensional datasets. The trick is to apply dimen¬ 
sion reduction first (see Chapter 12). We can then draw a connected scatterplot in the 
dimension-reduced space. As an example of this approach, we will visualize a data¬ 
base of monthly observations of over 100 macroeconomic indicators, provided by the 
Federal Reserve Bank of St. Louis. We perform a principal components analysis 
(PCA) of all indicators and then draw a connected scatterplot of PC 2 versus PC 1 
(Figure 13- 12a) and versus PC 3 (Figure 13- 12b). 
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Figure 13-12. Visualizing a high-dimensional time series as a connected scatterplot in 
principal components space. The path indicates the joint movement of over 100 macro- 
economic indicators from January 1990 to December 2017. Times of recession and recov¬ 
ery are indicated via color, and the endpoints of the three recessions (March 1991, 
November 2001, and June 2009) are also labeled, (a) PC 2 versus PC 1. (b) PC 2 versus 
PC 3. Data source: M. W. McCracken, St. Louis Fed. 
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Notably, Figure 13- 12a looks almost like a regular line plot, with time running from 
left to right. This pattern is caused by a common feature of PCA: the first component 
often measures the overall size of the system. Here, PC 1 approximately measures the 
overall size of the economy, which rarely decreases over time. 

By coloring the connected scatterplot by times of recession and recovery, we can see 
that recessions are associated with a drop in PC 2 whereas recoveries do not corre¬ 
spond to a specific feature in either PC 1 or PC 2 (Figure 13- 12a). The recoveries do, 
however, seem to correspond to a drop in PC 3 (Figure 13- 12b). Moreover, in the 
PC 2 versus PC 3 plot, we see that the line follows the shape of a clockwise spiral. This 
pattern emphasizes the cyclical nature of the economy, with recessions following 
recoveries and vice versa. 
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CHAPTER 14 


Visualizing Trends 


When making scatterplots (Chapter 12) or time series (Chapter 13), we are often 
more interested in the overarching trend of the data than in the specific detail of 
where each individual data point lies. By drawing the trend on top of or instead of the 
actual data points, usually in the form of a straight or curved line, we can create a 
visualization that helps the reader immediately see key features of the data. There are 
two fundamental approaches to determining a trend: we can either smooth the data 
by some method, such as a moving average, or we can fit a curve with a defined func¬ 
tional form and then draw the fitted curve. Once we have identified a trend in a data¬ 
set, it may also be useful to look specifically at deviations from the trend or to 
separate the data into multiple components, including the underlying trend, any 
existing cyclical components, and episodic components or random noise. 

Smoothing 

Let us consider a time series of the Dow Jones Industrial Average (Dow Jones for 
short), a stock market index representing the price of 30 large, publicly owned US 
companies. Specifically, we will look at the year 2009, right after the 2008 crash 
(Figure 14-1). During the tail end of the crash, in the first 3 months of the year 2009, 
the market lost over 2,400 points (-27%). Then it slowly recovered for the remainder 
of the year. How can we visualize these longer-term trends while deemphasizing the 
less important short-term fluctuations? 
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Jan 2009 Apr 2009 Jul 2009 Oct 2009 Jan 2010 

Figure 14-1. Daily closing values of the Dow Jones Industrial Average for the year 2009. 
Data source: Yahoo! Finance. 

In statistical terms, we are looking for a way to smooth the stock market time series. 
The act of smoothing produces a function that captures key patterns in the data while 
removing irrelevant minor detail or noise. Financial analysts usually smooth stock 
market data by calculating moving averages. To generate a moving average, we take a 
time window, say the first 20 days in the time series, calculate the average price over 
these 20 days, then move the time window by one day, so it now spans the 2nd to 21st 
days. We then calculate the average over these 20 days, move the time window again, 
and so on. The result is a new time series consisting of a sequence of averaged prices. 

To plot this sequence of moving averages, we need to decide which specific time point 
to associate with the average for each time window. Financial analysts often plot each 
average at the end of its respective time window. This choice results in curves that lag 
the original data (Figure 14-2a), with more severe lags corresponding to larger aver¬ 
aging time windows. Statisticians, on the other hand, plot the average at the center of 
the time window, which results in a curve that overlays perfectly on the original data 
(Figure 14-2b). 
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Figure 14-2. Daily closing values of the Dow Jones Industrial Average for the year 2009, 
shown together with their 20-day, 50-day, and 100-day moving averages, (a) The mov¬ 
ing averages are plotted at the ends of the moving time windows, (b) The moving aver¬ 
ages are plotted in the centers of the moving time windows. Data source: Yahoo! Finance. 

Regardless of whether we plot the smoothed time series with or without lag, we can 
see that the length of the time window over which we average sets the scale of the 
fluctuations that remain visible in the smoothed curve. The 20-day moving average 
removes small, short-term spikes but otherwise follows the daily data closely. The 
100-day moving average, on the other hand, removes even fairly substantial drops or 
spikes that play out over a time span of multiple weeks. For example, the massive 
drop to below 7,000 points in the first quarter of 2009 is not visible in the 100-day 
moving average, which replaces it with a gentle curve that doesn’t dip much below 
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8,000 points (Figure 14-2). Similarly, the drop around July 2009 is completely invisi¬ 
ble in the 100-day moving average. 

The moving average is the most simplistic approach to smoothing, and it has some 
obvious limitations. First, it results in a smoothed curve that is shorter than the origi¬ 
nal curve (Figure 14-2). Parts are missing at either the beginning or the end or both. 
And the more the time series is smoothed (i.e., the larger the averaging window), the 
shorter the smoothed curve. Second, even with a large averaging window, a moving 
average is not necessarily that smooth. It may exhibit small bumps and wiggles even 
though larger-scale smoothing has been achieved (Figure 14-2). These wiggles are 
caused by individual data points that enter or exit the averaging window. Since all 
data points in the window are weighted equally, individual data points at the window 
boundaries can have a visible impact on the average. 

Statisticians have developed numerous approaches to smoothing that alleviate the 
downsides of moving averages. These approaches are much more complex and com¬ 
putationally costly, but they are readily available in modern statistical computing 
environments. One widely used method is locally estimated scatterplot smoothing 
(LOESS) [Cleveland 1979], which fits low-degree polynomials to subsets of the data. 
Importantly, the points in the center of each subset are weighted more heavily than 
points at the boundaries, and this weighting scheme yields a much smoother result 
than we get from a weighted average. The LOESS curve shown in Figure 14-3 looks 
similar to the 100-day average in Figure 14-2, but this similarity should not be overin¬ 
terpreted. The smoothness of a LOESS curve can be tuned by adjusting a parameter, 
and different parameter choices would have produced LOESS curves looking more 
like the 20-day or 50-day average. 

Importantly, LOESS is not limited to time series. It can be applied to arbitrary scatter- 
plots, as is apparent from its name, locally estimated scatterplot smoothing. For exam¬ 
ple, we can use LOESS to look for trends in the relationship between a car’s fuel-tank 
capacity and its price (Figure 14-4). The LOESS line shows that tank capacity grows 
approximately linearly with price for cheap cars (below $20,000) but levels off for 
more expensive cars. Above a price of approximately $20,000, buying a more expen¬ 
sive car will not get you one with a larger fuel tank. 
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Figure 14-3. Comparison of LOESS fit to 100-day moving average for the Dow Jones 
data of Figure 14-2. The overall trend shown by the LOESS smooth is nearly identical to 
the 100-day moving average, but the LOESS curve is much smoother and it extends to 
the entire range of the data. Data source: Yahoo! Finance. 
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Figure 14-4. Fuel-tank capacity versus price of 93 cars released for the 1993 model year. 
Each dot corresponds to one car. The solid line represents a LOESS smooth of the data. 
We see that fuel-tank capacity increases approximately linearly with price, up to a price 
of approximately $20,000, and then it levels off. Data source: Robin H. Lock, St. Law¬ 
rence University. 
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LOESS is a very popular smoothing approach because it tends to produce results that 
look right to the human eye. However, it requires the fitting of many separate regres¬ 
sion models. This makes it slow for large datasets, even on modern computing equip¬ 
ment. 

As a faster alternative to LOESS, we can use spline models. A spline is a piecewise 
polynomial function that is highly flexible yet always looks smooth. When working 
with splines, we will encounter the term knot. The knots in a spline are the endpoints 
of the individual spline segments. If we fit a spline with k segments, we need to spec¬ 
ify k + 1 knots. While spline fitting is computationally efficient, in particular if the 
number of knots is not too large, splines have their own downsides. Most impor¬ 
tantly, there is a bewildering array of different types of splines, including cubic 
splines, B-splines, thin-plate splines, Gaussian process splines, and many others, and 
which one to pick may not be obvious. The specific choice of the type of spline and 
number of knots used can result in widely different smoothing functions for the same 
data (Figure 14-5). 

Most data visualization software will provide smoothing features, likely implemented 
as either a type of local regression (such as LOESS) or a type of spline. The smoothing 
method may be referred to as a generalized additive model (GAM), which is a superset 
of all these types of smoothers. It is important to be aware that the output of the 
smoothing feature is dependent on the specific GAM model that is fit. Unless you try 
out a number of different choices you may never realize to what extent the results you 
see depend on the specific default choices made by your statistical software. 



Be careful when interpreting the results from a smoothing func¬ 
tion. The same dataset can be smoothed in many different ways. 
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Figure 14-5. Different smoothing models display widely different behaviors, in particular 
near the boundaries of the data, (a) LOESS smoother, as in Figure 14-4. (b) Cubic 
regression splines with 5 knots, (c) Thin-plate regression spline with 3 knots, (d) Gaus¬ 
sian process spline with 6 knots. Data source: Robin H. Lock, St. Lawrence University. 


Showing Trends with a Defined Functional Form 

As we can see in Figure 14-5, the behavior of general-purpose smoothers can be 
somewhat unpredictable for any given dataset. These smoothers also do not provide 
parameter estimates that have a meaningful interpretation. Therefore, whenever pos¬ 
sible, it is preferable to fit a curve with a specific functional form that is appropriate 
for the data and that uses parameters with clear meaning. 

For the fuel-tank data, we need a curve that initially rises linearly but then levels off at 
a constant value. The function y = A - B exp(-mx) may fit that bill. Here, A, B, and m 
are the constants we adjust to fit the curve to the data. The function is approximately 
linear for small x, with y ~ A - B + Bmx; it approaches a constant value for large 
x, y ~ A, and it is strictly increasing for all values of x. Figure 14-6 shows that this 
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equation fits the data at least as well as any of the smoothers we considered previously 
(Figure 14-5). 
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Figure 14-6. Fuel-tank data represented with an explicit analytical model. The solid line 
corresponds to a least-squares fit of the formula y = A - B exp(-mx) to the data. Fitted 
parameters are A = 19.6, B = 29.2, m = 0.00015. Data source: Robin H. Lock, St. Law¬ 
rence University. 

A functional form that is applicable in many different contexts is the simple straight 
line, y = A + mx. Approximately linear relationships between two variables are sur¬ 
prisingly common in real-world datasets. For example, in Chapter 12, 1 discussed the 
relationship between head length and body mass in blue jays. This relationship is 
approximately linear, for both female and male birds, and drawing linear trend lines 
on top of the points in a scatterplot helps the reader perceive the trends (Figure 14-7). 

When the data displays a nonlinear relationship, we need to guess what an appropri¬ 
ate functional form might be. In this case, we can assess the accuracy of our guess by 
transforming the axes in such a way that a linear relationship emerges. To demon¬ 
strate this principle, let’s return to the monthly submissions to the preprint server 
bioRxiv, discussed in Chapter 12. If the increase in submissions in each month is pro¬ 
portional to the number of submissions in the previous month—i.e., if submissions 
grow by a fixed percentage each month—then the resulting curve is exponential. This 
assumption seems to be met for the bioRxiv data, because a curve with exponential 
form, y = A exp (mx), fits the bioRxiv submission data well (Figure 14-8). 
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Figure 14-7. Head length versus body mass for 123 blue jays. The birds’ sex is indicated 
by color. This figure is equivalent to Figure 12-2, except that now we have drawn linear 
trend lines on top of the individual data points. Data source: Keith Tarvin, Oberlin 
College. 



Figure 14-8. Monthly submissions to the preprint server bioRxiv. The solid blue line rep¬ 
resents the actual monthly preprint counts and the dashed black line represents an expo¬ 
nential fit to the data, y = 60 exp[0.77(x - 2014)]. Data source: Jordan Anaya, http:// 
www.prepubmed.org/. 
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If the original curve is exponential, y = A exp (mx), then a log-transformation of the y 
values will turn it into a linear relationship, log(y) = log(A) + mx. Therefore, plotting 
the data with log-transformed y values (or equivalently, with a logarithmic y axis) and 
looking for a linear relationship is a good way of determining whether a dataset 
exhibits exponential growth. For the bioRxiv submission numbers, we indeed obtain 
a linear relationship when using a logarithmic y axis (Figure 14-9). 



Figure 14-9. Monthly submissions to the preprint server bioRxiv, shown on a log scale. 
The solid blue line represents the actual monthly preprint counts, the dashed black line 
represents the exponential fit from Figure 14-8, and the solid black line represents a lin¬ 
ear fit to log-transformed data, corresponding toy = 43 exp[0.88(x - 2014)]. Data 
source: Jordan Anaya, http://www.prepubmed.org/. 

In Figure 14-9, in addition to the actual submission counts, I am also showing the 
exponential fit from Figure 14-8 and a linear fit to the log-transformed data. These 
two fits are similar but not identical. In particular, the slope of the dashed line seems 
somewhat off. The line systematically falls above the individual data points for half 
the time series. This is a common problem with exponential fits: the square devia¬ 
tions from the data points to the fitted curve are so much larger for the largest data 
values than for the smallest data values that the deviations of the smallest data values 
contribute little to the overall sum of squares that the fit minimizes. As a result, the 
fitted line systematically overshoots or undershoots the smallest data values. For this 
reason, I generally advise to avoid exponential fits and instead use linear fits on log- 
transformed data. 
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It is usually better to fit a straight line to transformed data than to 
fit a nonlinear curve to untransformed data. 


A plot such as Figure 14-9 is commonly referred to as log-linear, since the y axis is 
logarithmic and the x axis is linear. Other plots we may encounter include log-log, 
where both the y and the x axis are logarithmic, and linear-log, where y is linear and x 
is logarithmic. In a log-log plot, power laws of the form y ~ x“ appear as straight lines 
(see Figure 8-7 for an example), and in a linear-log plot, logarithmic relationships of 
the form y ~ log(x) appear as straight lines. Other functional forms can be turned 
into linear relationships with more specialized coordinate transformations, but these 
three (log-linear, log-log, linear-log) cover a wide range of real-world applications. 

Detrending and Time-Series Decomposition 

For any time series with a prominent long-term trend, it may be useful to remove this 
trend to specifically highlight any notable deviations. This technique is called detrend¬ 
ing, and I will demonstrate it here with house prices. In the US, the mortgage lender 
Freddie Mac publishes a monthly index called the Freddie Mac House Price Index that 
tracks the change in housing prices over time. The index attempts to capture the state 
of the entire house market in a given region, such that an increase in the index by, for 
example, 10% can be interpreted as an average house price increase of 10% in the 
respective market. The index is arbitrarily set to a value of 100 in December 2000. 

Over long periods of time, house prices tend to display consistent annual growth, 
approximately in line with inflation. However, overlaid on top of this trend are hous¬ 
ing bubbles that lead to severe boom and bust cycles. Figure 14-10 shows the actual 
house price index and its long-term trend for four select US states. We see that 
between 1980 and 2017, California underwent two bubbles, one in 1990 and one in 
the mid-2000s. During the same period, Nevada experienced only one bubble, in the 
mid-2000s, and house prices in Texas and West Virginia closely followed their long¬ 
term trends the entire time. Because house prices tend to grow in percent increments, 
i.e., exponentially, I have chosen a logarithmic y axis in Figure 14-10. The straight 
lines correspond to a 4.7% annual price increase in California and a 2.8% annual 
price increase each in Nevada, Texas, and West Virginia. 
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Figure 14-10. Freddie Mac House Price Index from 1980 through 2017, for four selected 
states (California, Nevada, Texas, and West Virginia). The House Price Index is a unit¬ 
less number that tracks relative house prices in the chosen geographic region over time. 
The index is scaled arbitrarily such that it equals 100 in December of the year 2000. The 
blue lines show the monthly fluctuations in the index and the straight gray lines show the 
long-term price trends in the respective states. Note that the y axes are logarithmic, so 
that the straight gray lines represent consistent exponential growth. Data source: Freddie 
Mac House Prices Index. 


We detrend housing prices by dividing the actual price index at each time point by the 
respective value in the long-term trend. Visually, this division will look like we are 
subtracting the gray lines from the blue lines in Figure 14-10, because a division of 
the untransformed values is equivalent to a subtraction of the log-transformed values. 
The resulting detrended house prices show the housing bubbles more clearly 
(Figure 14-11), as the detrending emphasizes the unexpected movements in a time 
series. For example, in the original time series, the decline in home prices in Califor¬ 
nia from 1990 to about 1998 looks modest (Figure 14-10). However, during that same 
time period, on the basis of the long-term trend we would have expected prices to 
rise. Relative to the expected rise the drop in prices was substantial, amounting to 
25% at the lowest point (Figure 14-11). 
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Figure 14-11. Detrended version of the Freddie Mac House Price Index shown in 
Figure 14-10. The detrended index was calculated by dividing the actual index (blue 
lines in Figure 14-10) by the expected value based on the long-term trend (straight gray 
lines in Figure 14-10). This visualization shows that California experienced two housing 
bubbles, around 1990 and in the mid-2000s, identifiable from a rapid rise and subse¬ 
quent decline in the actual housing prices relative to what would have been expected 
from the long-term trend. Similarly, Nevada experienced one housing bubble, in the 
mid-2000s, and neither Texas nor West Virginia experienced much of a bubble at all. 
Data source: Freddie Mac House Prices Index. 


Beyond simple detrending, we can also separate a time series into multiple distinct 
components, such that their sum recovers the original time series. In general, in addi¬ 
tion to a long-term trend, there are three distinct components that may shape a time 
series. First, there is random noise, which causes small, erratic movements up and 
down. This noise is visible in all the time series shown in this chapter, but maybe the 
most in Figure 14-9. Second, there can be unique external events that leave their 
mark in the time series, such as the distinct housing bubbles seen in Figure 14-10. 
Third, there can be cyclical variations. For example, outside temperatures show daily 
cyclical variations. The highest temperatures are reached in the early afternoon and 
the lowest temperatures in the early morning. Outside temperatures also show yearly 
cyclical variations. They tend to rise in the spring, reach their maximum in the sum¬ 
mer, and then decline in the fall and reach their minimum in the winter (Figure 3-2). 
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To demonstrate the concept of distinct time-series components, I will here decom¬ 
pose the Keeling curve, which shows changes in C0 2 abundance over time 
(Figure 14-12). Since 1958, CO, abundance has been continuously monitored at the 
Mauna Loa Observatory in Hawaii, initially under the direction of Charles Keeling. 

C0 2 is measured in parts per million (ppm). We see a long-term increase in C0 2 
abundance that is slightly faster than linear, from below 325 ppm in the 1960s to 
above 400 in the second decade of the 21st century (Figure 14-12). CO, abundance 
also fluctuates annually, following a consistent up-and-down pattern overlaid on top 
of the overall increase. The annual fluctuations are driven by plant growth in the 
northern hemisphere. Plants consume CO, during photosynthesis. Because most of 
the globes land masses are located in the northern hemisphere, and plant growth is 
most active in the spring and summer, we see an annual global decline in atmospheric 
C0 2 that coincides with the summer months in the northern hemisphere. 


400 



c 

<u 

u 

c 

o 

u 

CM 


350 


8 325 



300 


1970 


1990 


2010 


Figure 14-12. The Keeling curve. The Keeling curve shows the change ofC0 2 abundance 
in the atmosphere over time. Shown here are monthly average C0 2 readings, expressed 
in parts per million (ppm). The CO, readings fluctuate annually with the seasons but 
show a consistent long-term trend of increase. Data source: Dr. Pieter Tans, NOAA/ 
ESRL, and Dr. Ralph Keeling, Scripps Institution of Oceanography. 


We can decompose the Keeling curve into its long-term trend, seasonal fluctuations, 
and remainder (Figure 14-13). The specific method I am using here is called seasonal 
decomposition of time series by LOESS (STL) [Cleveland et al. 1990], but there are 
many other methods that achieve similar goals. 
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Figure 14-13. Time-series decomposition of the Keeling curve, showing the monthly aver¬ 
age (as in Figure 14-12), the long-term trend, seasonal fluctuations, and the remainder. 
The remainder is the difference between the actual readings and the sum of the long¬ 
term trend and the seasonal fluctuations, and it represents random noise. I have zoomed 
into the most recent 30 years of data to emphasize the shape of the annual fluctuations. 
Data source: Dr. Pieter Tans, NOAA/ESRL, and Dr. Ralph Keeling, Scripps Institution of 
Oceanography. 


The decomposition shows that over the last three decades, C0 2 abundance has 
increased by over 50 ppm. By comparison, seasonal fluctuations amount to less than 
8 ppm (they never cause an increase or a decrease of more than 4 ppm relative to the 
long-term trend), and the remainder amounts to less than 1.6 ppm (Figure 14-13). 
The remainder is the difference between the actual readings and the sum of the long¬ 
term trend and the seasonal fluctuations, and here it corresponds to random noise in 
the monthly C0 2 readings. More generally, however, the remainder could also 
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capture unique external events. For example, if a massive volcano eruption released 
substantial amounts of CO,, such an event might be visible as a sudden spike in the 
remainder. Figure 14-13 shows that no such unique external events have had a major 
effect on the Keeling curve in recent decades. 
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CHAPTER 15 


Visualizing Geospatial Data 


Many datasets contain information linked to locations in the physical world. For 
example, in an ecological study, a dataset may list where specific plants or animals 
have been found. Similarly, in a socioeconomic or political context, a dataset may 
contain information about where people with specific attributes (such as income, age, 
or educational attainment) live, or where man-made objects (e.g., bridges, roads, 
buildings) have been constructed. In all these cases, it can be helpful to visualize the 
data in their proper geospatial context, i.e., to show the data on a realistic map or 
alternatively as a map-like diagram. 

Maps tend to be intuitive to readers, but they can be challenging to design. We need 
to think about concepts such as map projections and whether for our specific applica¬ 
tion the accurate representation of angles or areas is more critical. A common map¬ 
ping technique, the choropleth map, consists of representing data values as differently 
colored spatial areas. Choropleth maps can at times be very useful and at other times 
quite misleading. As an alternative, we can construct map-like diagrams called carto- 
grams, which may purposefully distort map areas or represent them in stylized form, 
for example as equal-sized squares. 

Projections 

The earth is approximately a sphere (Figure 15-1), and more precisely an oblate sphe¬ 
roid that is slightly flattened along its axis of rotation. The two locations where the 
axis of rotation intersects with the spheroid are called the poles (north and south). We 
separate the spheroid into two hemispheres, the northern and the southern hemi¬ 
sphere, by drawing a line equidistant to both poles around the spheroid. This line is 
called the equator. To uniquely specify a location on the earth, we need three pieces of 
information: where we are located along the direction of the equator (the longitude), 
how close we are to either pole when moving perpendicular to the equator (the 
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latitude), and how far we are from the earth’s center (the altitude). Longitude, latitude, 
and altitude are specified relative to a reference system called the datum. The datum 
specifies properties such as the shape and size of the earth, as well as the location of 
zero longitude, latitude, and altitude. One widely used datum is the World Geodetic 
System (WGS) 84, which is used by the Global Positioning System (GPS). 



Figure 15-1. Orthographic projection of the world, showing Europe and Northern Africa 
as they would be visible from space. The lines emanating from the north pole and run¬ 
ning south are called meridians, and the lines running orthogonal to the meridians are 
called parallels. All meridians have the same length, but parallels become shorter the 
closer we are to either pole. 

While altitude is an important quantity in many geospatial applications, when visual¬ 
izing geospatial data in the form of maps we are primarily concerned with the other 
two dimensions, longitude and latitude. Both longitude and latitude are angles, 
expressed in degrees. Degrees longitude measure how far east or west a location lies. 
Lines of equal longitude are referred to as meridians, and all meridians terminate at 
the two poles (Figure 15-1). The prime meridian, corresponding to 0° longitude, runs 
through the village of Greenwich in the United Kingdom. The meridian opposite to 
the prime meridian lies at 180° longitude (also referred to as 180°E), which is 
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equivalent to -180° longitude (also referred to as 180°W), near the international date 
line. Degrees latitude measure how far north or south a location lies. The equator cor¬ 
responds to 0° latitude, the north pole corresponds to 90° latitude (also referred to as 
90°N), and the south pole corresponds to -90° latitude (also referred to as 90°S). 
Lines of equal latitude are referred to as parallels, since they run parallel to the equa¬ 
tor. All meridians have the same length, corresponding to half of a great circle around 
the globe, whereas the length of parallels depends on their latitude (Figure 15-1). The 
longest parallel is the equator, at 0° latitude, and the shortest parallels lie at the north 
and south poles, 90°N and 90°S, and have length zero. 

The challenge in map making is that we need to take the spherical surface of the earth 
and flatten it out so we can display it on a map. This process, called projection, neces¬ 
sarily introduces distortions, because a curved surface cannot be projected exactly 
onto a flat surface. Specifically, the projection can preserve either angles or areas but 
not both. A projection that does the former is called conformal and a projection that 
does the latter is called equal-area. Other projections may preserve neither angles nor 
areas but instead preserve other quantities of interest, such as distances to some refer¬ 
ence point or line. Finally, some projections attempt to strike a compromise between 
preserving angles and areas. These compromise projections are frequently used to 
display the entire world in an aesthetically pleasing manner, and they accept some 
amount of both angular and area distortion (Figure 3-11). To systematize and keep 
track of different ways of projecting parts or all of the earth for specific maps, various 
standards bodies and organizations, such as the European Petroleum Survey Group 
(EPSG) and the Environmental Systems Research Institute (ESRI), maintain registries 
of projections. For example, EPSG:4326 represents unprojected longitude and lati¬ 
tude values in the WGS 84 coordinate system used by GPS. Several websites provide 
convenient access to these registered projections, including http://spatialreference.org/ 
and https://epsg.io/. 

One of the earliest map projections in use, the Mercator projection, was developed in 
the 16th century for nautical navigation. It is a conformal projection that accurately 
represents shapes but introduces severe area distortions near the poles (Figure 15-2). 
The Mercator projection maps the globe onto a cylinder and then unrolls the cylinder 
to arrive at a rectangular map. Meridians in this projection are evenly spaced vertical 
lines, whereas parallels are horizontal lines whose spacing increases the further we 
move away from the equator. The spacing between parallels increases in proportion 
to the extent to which they have to be stretched closer to the poles to keep the meridi¬ 
ans perfectly vertical. 
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Figure 15-2. Mercator projection of the world. In this projection, parallels are straight 
horizontal lines and meridians are straight vertical lines. It is a conformal projection 
preserving local angles, but it introduces severe distortions in areas near the poles. For 
example, Greenland appears to be bigger than Africa in this projection, when in reality 
Africa is 14 times bigger than Greenland (see Figures 15-1 and 15-3). 

Because of the severe area distortions it produces, the Mercator projection has fallen 
out of favor for maps of the entire world. However, variants of this projection con¬ 
tinue to live on. For example, the transverse Mercator projection is routinely used for 
large-scale maps that show moderately small areas (spanning less than a few degrees 
in longitude) at large magnification. Another variant, the web Mercator projection, 
was introduced by Google for Google Maps and is used by several online mapping 
applications. 
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A whole-world projection that is perfectly area-preserving is the Goode homolosine 
(Figure 15-3). It is usually shown in its interrupted form, which has one cut in the 
northern hemisphere and three cuts in the southern hemisphere, carefully chosen so 
they don’t interrupt major land masses (Figure 15-3). The cuts allow the projection to 
both preserve areas and approximately preserve angles, at the cost of noncontiguous 
oceans, a cut through the middle of Greenland, and several cuts through Antarctica. 
While the interrupted Goode homolosine has an unusual aesthetic and a strange 
name, it is a good choice for mapping applications that require accurate reproduction 
of areas on a global scale. 



Figure 15-3. Interrupted Goode homolosine projection of the world. This projection 
accurately preserves areas while minimizing angular distortions, at the cost of showing 
oceans and some land masses (Greenland, Antarctica) in a noncontiguous way. 
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Shape or area distortions due to map projections are particularly prominent when 
we’re attempting to make a map of the whole world, but they can cause trouble even 
at the scale of individual continents or countries. As an example, consider the United 
States, which consists of the lower 48 (which are 48 contiguous states), Alaska, and 
Hawaii (Figure 15-4). While the lower 48 alone are reasonably easy to project onto a 
map, Alaska and Hawaii are so distant from the lower 48 that projecting all 50 states 
onto one map becomes awkward. 



Figure 15-4. Relative locations of Alaska, Hawaii, and the lower 48 states shown on a 
globe. 
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Figure 15-5 shows a map of all 50 states made using an equal-area Albers projection. 
This projection provides a reasonable representation of the relative shapes, areas, and 
locations of the 50 states, but we notice some issues. First, Alaska seems weirdly 
stretched out compared to how it looks, for example, in Figures 15-2 or 15-4. Second, 
the map is dominated by ocean/empty space. It would be preferable to zoom in fur¬ 
ther, so that the lower 48 states take up a larger proportion of the map area. 
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Figure 15-5. Map of the United States of America, using an area-preserving Albers pro¬ 
jection (ESR1:102003, commonly used to project the lower 48 states). Alaska and Hawaii 
are shown in their true locations. 


Projections | 167 



To address the problem of uninteresting empty space, it is common practice to 
project Alaska and Hawaii separately (to minimize shape distortions) and then move 
them so they are shown underneath the lower 48 (Figure 15-6). You may notice in 
Figure 15-6 that Alaska looks much smaller relative to the lower 48 than it does in 
Figure 15-5. The reason for this discrepancy is that Alaska has not only been moved, 
it also has been scaled so it looks comparable in size to typical midwestern or western 
states. This scaling, while common practice, is misleading, and therefore I have 
labeled the figure as “bad.” 


bad 



Figure 15-6. Visualization of the United States, with the states of Alaska and Hawaii 
moved to lie underneath the lower 48 states. Alaska also has been scaled so its linear 
extent is only 35% of the state’s true size. (In other words, the state’s area has been 
reduced to approximately 12% of its true size.) Such a scaling is frequently applied to 
Alaska, to make it visually appear to be of similar size as typical midwestern or western 
states. However, the scaling is misleading, and therefore the figure has been labeled as 
“bad.” 

Instead of both moving and scaling Alaska, we could just move it without changing 
its scale (Figure 15-7). This visualization reveals that Alaska is the largest state, over 
twice the size of Texas. We are not used to seeing the US shown in this way, but in my 
mind it is a much more reasonable representation of the 50 states than is Figure 15-6. 
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Figure 15-7. Visualization of the United States, with the states of Alaska and Hawaii 
moved to lie underneath the lower 48 states. 

Layers 

To visualize geospatial data in the proper context, we usually create maps consisting 
of multiple layers showing different types of information. To demonstrate this con¬ 
cept, I will visualize the locations of wind turbines in the San Francisco Bay area. In 
the Bay Area, wind turbines are clustered in two locations. One location, which I will 
refer to as the Shiloh Wind Farm, lies near Rio Vista and the other lies east of Hay¬ 
ward near Tracy (Figure 15-8). 

Figure 15-8 consists of four separate layers. At the bottom, we have the terrain layer, 
which shows hills, valleys, and water. The next layer shows the road network. On top 
of the road layer, I have placed a layer indicating the locations of individual wind tur¬ 
bines. This layer also contains the two rectangles highlighting the majority of the 
wind turbines. Finally, the top layer adds the locations and names of cities. These four 
layers are shown separately in Figure 15-9. For any given map we want to make, we 
may want to add or remove some of these layers. For example, if we wanted to draw a 
map of voting districts, we might consider terrain information to be irrelevant and 
distracting. Alternatively, if we wanted to draw a map of exposed or covered roof 
areas to assess potential for solar power generation, we might want to replace terrain 
information with satellite imagery that shows individual roofs and actual vegetation. 
You can interactively try these different types of layers in most online map 
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applications, such as Google Maps. I would like to emphasize that regardless of which 
layers you decide to keep or remove, it is generally recommended to add a scale bar 
and a north arrow. The scale bar helps readers understand the size of the spatial fea¬ 
tures shown in the map, while the north arrow clarifies the map’s orientation. 
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Figure 15-8. Wind turbines in the San Francisco Bay Area. Individual wind turbines are 
shown as purple-colored dots. Two regions with a high concentration of wind turbines 
are highlighted with black rectangles. I refer to the wind turbines near Rio Vista collec¬ 
tively as the Shiloh Wind Farm. Map tiles by Stamen Design, under CC BY 3.0. Map 
data by OpenStreetMap, under ODbL. Wind turbine data source: US Wind Turbine 
Database. 
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Figure 15-9. The individual layers of Figure 15-8. From bottom to top, the figure consists 
of a terrain layer, a roads layer, a layer showing the wind turbines, and a layer labeling 
cities and adding a scale bar and north arrow. Map tiles by Stamen Design, under CC 
BY 3.0. Map data by OpenStreetMap, under ODbL. Wind turbine data source: US Wind 
Turbine Database. 

All the concepts discussed in Chapter 2 of mapping data onto aesthetics carry over to 
maps. We can place data points into their geographic context and show other data 
dimensions via aesthetics such as color or shape. For example, Figure 15-10 provides 
a zoomed-in view of the rectangle labeled “Shiloh Wind Farm” in Figure 15-8. Indi¬ 
vidual wind turbines are shown as dots, with the color representing when a specific 
turbine was built and the shape representing the project to which the wind turbine 
belongs. A map such as this one can provide a quick overview of how an area was 
developed. For example, here we see that EDF Renewables is a relatively small project 
built before 2000, High Winds is a moderately sized project built between 2000 and 
2004, and Shiloh and Solano are the largest two projects in the area, both built over 
an extended period of time. 
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Figure 15-10. Locations of individual wind turbines in the Shiloh Wind Farm. Each dot 
highlights the location of one wind turbine. The map area corresponds to the top rectan¬ 
gle in Figure 15-8. Dots are colored by when the wind turbine was built, and the dot’s 
shape represents the project to which an individual wind turbine belongs. Map tiles by 
Stamen Design, under CC BY 3.0. Map data by OpenStreetMap, under ODbL. Wind 
turbine data source: US Wind Turbine Database. 

Choropleth Mapping 

We frequently want to show how some quantity varies across locations. We can do so 
by coloring individual regions in a map according to the data dimension we want to 
display. Such maps are called choropleth maps. 

As a simple example, consider the population density (persons per square kilometer) 
across the United States. We take the population number for each county in the US, 
divide it by the county’s surface area, and then draw a map where the color of each 
county corresponds to the ratio between population number and area (Figure 15-11). 
We can see that the major cities on the East and West Coast are the most populated 
areas of the US, the Great Plains and western states have low population densities, 
and the state of Alaska is the least populated of all. 
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Figure 15-11. Population density in every US county, shown as a choropleth map. Popu¬ 
lation density is reported as persons per square kilometer. Data source: 2015 Five-Year 
American Community Survey. 

Figure 15-11 uses light colors to represent low population densities and dark colors to 
represent high densities, so that high-density metropolitan areas stand out as dark 
colors on a background of light colors. We tend to associate darker colors with higher 
intensities when the background color of the figure is light. However, we can also pick 
a color scale where high values light up on a dark background (Figure 15-12). As long 
as the lighter colors fah into the red-yellow spectrum, so that they appear to be glow¬ 
ing, they can be perceived as representing higher intensities. As a general principle, 
when figures are meant to be printed on white paper, light-colored background areas 
(as in Figure 15-11) will typically work better. For online viewing or on a dark back¬ 
ground, dark-colored background areas (as in Figure 15-12) may be preferable. 
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Figure 15-12. Population density in every US county, shown as a choropleth map. This 
map is identical to Figure 15-11 except that now the color scale uses light colors for high 
population densities and dark colors for low population densities. Data source: 2015 
Five-Year American Community Survey. 

Choropleths work best when the coloring represents a density (i.e., some quantity 
divided by surface area, as in Figures 15-11 and 15-12). We perceive larger areas as 
corresponding to larger amounts than smaller areas (see also Chapter 17), and shad¬ 
ing by density corrects for this effect. However, in practice, we often see choropleths 
colored according to some quantity that is not a density. For example, in Figure 4-4 I 
showed a choropleth of median annual income in Texas counties. Such choropleth 
maps can be appropriate when they are prepared with caution. There are two condi¬ 
tions under which we can color-map quantities that are not densities. First, if all the 
individual areas we color have approximately the same size and shape, then we don’t 
have to worry about some areas drawing disproportionate attention solely due to 
their size. Second, if the individual areas we color are relatively small compared to the 
overall size of the map and if the quantity that color represents changes on a scale 
larger than the individual colored areas, then again we don’t have to worry about 
some areas drawing disproportionate attention solely due to their size. Both of these 
conditions are approximately met in Figure 4-4. 

It is also important to consider the effect of continuous versus discrete color scales in 
choropleth mapping. While continuous color scales tend to look visually appealing 
(e.g., Figures 15-11 and 15-12), they can be difficult to read. We are not very good at 
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recognizing a specific color value and matching it against a continuous scale. There¬ 
fore, it is often appropriate to bin the data values into discrete groups that are repre¬ 
sented with distinct colors. On the order of four to six bins is a good choice. The 
binning sacrifices some information, but on the flip side the binned colors can be 
uniquely recognized. As an example, Figure 15-13 expands the map of median 
income in Texas counties (Figure 4-4) to all counties in the US, and it uses a color 
scale consisting of five distinct income bins. 



Figure 15-13. Median income in every US county, shown as a choropleth map. The 
median income values have been binned into five distinct groups, because binned color 
scales are generally easier to read than continuous color scales. Data source: 2015 Five- 
Year American Community Survey. 

Even though counties are not quite as equal-sized and even-shaped across the entire 
US as they are just within Texas, I think Figure 15-13 still works as a choropleth map. 
No individual county overly dominates the map. However, things look different when 
we draw a comparable map at the state level (Figure 15-14). Then Alaska dominates 
the choropleth and, because of its size, suggests that median incomes above $70,000 
are common. Yet Alaska is very sparsely populated (see Figures 15-11 and 15-12), and 
thus the income levels in Alaska apply only to a small portion of the US population. 
The vast majority of US counties, which are nearly all more populous than counties 
in Alaska, have a median income of below $60,000. 
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Figure 15-14. Median income in every US state, shown as a choropleth map. This map is 
visually dominated by the state of Alaska, which has a high median income but very low 
population density. At the same time, the densely populated high-income states on the 
East Coast do not appear very prominent on this map. In aggregate, this map provides a 
poor visualization of the income distribution in the US, and therefore I have labeled it as 
“bad.” Data source: 2015 Five-Year American Community Survey. 


Cartograms 

Not every map-like visualization has to be geographically accurate to be useful. For 
example, the problem with Figure 15-14 is that some states take up a comparatively 
large area but are sparsely populated, while others take up a small area yet have a 
large number of inhabitants. What if we deformed the states so their size was propor¬ 
tional to their number of inhabitants? Such a modified map is called a cartogram, and 
Figure 15-15 shows what it can look like for the median income dataset. We can still 
recognize individual states, yet we also see how the adjustment for population num¬ 
bers has introduced important modifications. The East Coast states, Florida, and Cal¬ 
ifornia have grown a lot in size, whereas the other western states and Alaska have 
collapsed. 
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Figure 15-15. Median income in every US state, shown as a cartogram. The shapes of 
individual states have been modified such that their area is proportional to their number 
of inhabitants. Data source: 2015 Five-Year American Community Survey. 

As an alternative to a cartogram with distorted shapes, we can also draw a much sim¬ 
pler cartogram heatmap, where each state is represented by a colored square 
(Figure 15-16). While this representation does not correct for the population number 
in each state, and thus underrepresents more populous states and overrepresents less 
populous states, at least it treats all states equally and doesn’t weight them arbitrarily 
by their shape or size. 

Finally, we can draw more complex cartograms by placing individual plots at the 
location of each state. For example, if we want to visualize the evolution of the unem¬ 
ployment rate over time for each state, it can help to draw an individual graph for 
each state and then arrange the graphs based on the approximate relative positions of 
the states to each other (Figure 15-17). For somebody who is familiar with the geog¬ 
raphy of the United States, this arrangement may make it easier to find the graphs for 
specific states than arranging them, for example, in alphabetical order. Furthermore, 
one would expect neighboring states to display similar patterns, and Figure 15-17 
shows that this is indeed the case. 
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Figure 15-16. Median income in every US state, shown as a cartogram heatmap. Each 
state is represented by an equally sized square, and the squares are arranged according to 
the approximate position of each state relative to the other states. This representation 
gives the same visual weight to each state. Data source: 2015 Five-Year American Com¬ 
munity Survey. 
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Figure 15-17. Unemployment rate leading up to and following the 2008 financial crisis, 
by state. Each panel shows the unemployment rate for one state, including the District of 
Columbia (DC), from January 2007 through May 2013. Vertical grid lines mark January 
of2008, 2010, and 2012. States that are geographically close tend to show similar trends 
in the unemployment rate. Data source: US Bureau of Labor Statistics. 
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CHAPTER 16 


Visualizing Uncertainty 


One of the most challenging aspects of data visualization is the visualization of uncer¬ 
tainty. When we see a data point drawn in a specific location, we tend to interpret it as 
a precise representation of the true data value. It is difficult to conceive that a data 
point could actually lie somewhere it hasn’t been drawn. Yet this scenario is ubiqui¬ 
tous in data visualization. Nearly every dataset we work with has some uncertainty, 
and whether and how we choose to represent this uncertainty can make a major dif¬ 
ference in how accurately our audience perceives the meaning of the data. 

Two commonly used approaches to indicate uncertainty are error bars and confi¬ 
dence bands. These approaches were developed in the context of scientific publica¬ 
tions, and they require some amount of expert knowledge to be interpreted correctly, 
yet they are precise and space-efficient. By using error bars, for example, we can show 
the uncertainties of many different parameter estimates in a single graph. For a lay 
audience, however, visualization strategies that create a strong intuitive impression of 
the uncertainty will be preferable, even if they come at the cost of either reduced visu¬ 
alization accuracy or less data-dense displays. Options here include frequency fram¬ 
ing, where we explicitly draw different possible scenarios in approximate proportions, 
or animations that cycle through different possible scenarios. 

Framing Probabilities as Frequencies 

Before we can discuss how to visualize uncertainty, we need to define what it actually 
is. We can intuitively grasp the concept of uncertainty most easily in the context of 
future events. If I am going to flip a coin, I don’t know ahead of time what the out¬ 
come will be. The eventual outcome is uncertain. I can also be uncertain about events 
in the past, however. If yesterday I looked out of my kitchen window exactly twice, 
once at 8 a.m. and once at 4 p.m., and I saw a red car parked across the street at 8 a.m. 
but not at 4 p.m., then I can conclude the car left at some point during the 8-hour 
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window, but I don’t know exactly when. It could have been 8:01 a.m., 9:30 a.m., 2 
p.m., or any other time during those eight hours. 

Mathematically, we deal with uncertainty by employing the concept of probability. A 
precise definition of probability is complicated and far beyond the scope of this book. 
Yet we can successfully reason about probabilities without understanding all the 
mathematical intricacies. For many problems of practical relevance it is sufficient to 
think about relative frequencies. Assume you perform some sort of random trial, 
such as a coin flip or rolling a die, and look for a particular outcome (e.g., heads or 
rolling a six). You can call this outcome success, and any other outcome failure. Then, 
the probability of success is approximately given by the fraction of times you’d see 
that outcome if you repeated the random trial over and over again. For instance, if a 
particular outcome occurs with a probability of 10%, then we expect that among 
many repeated trials that outcome will be seen in approximately 1 out of 10 cases. 

Visualizing a single probability is difficult. How would you visualize the chance of 
winning in the lottery, or the chance of rolling a six with a fair die? In both cases, the 
probability is a single number. We could treat that number as an amount and display 
it using any of the techniques discussed in Chapter 6, such as a bar graph or a dot 
plot, but the result would not be very useful. Most people lack an intuitive under¬ 
standing of how a probability value translates into experienced reality. Showing the 
probability value as a bar or as a dot placed on a line does not help with this problem. 

We can make the concept of probability tangible by creating a graph that emphasizes 
both the frequency aspect and the unpredictability of a random trial, for example by 
drawing squares of different colors in a random arrangement. In Figure 16-1, I use 
this technique to visualize three different probabilities, a 1% chance of success, a 10% 
chance of success, and a 40% chance of success. To read this figure, imagine you are 
given the task of picking a dark square by choosing a square before you can see which 
of the squares will be dark and which ones will be light. (If you will, you can think of 
picking a square with your eyes closed.) Intuitively, you will probably understand that 
you would be unlikely to select the one dark square in the 1% chance case. Similarly, 
it would still be fairly unlikely for you to select a dark square in the 10% chance case. 
However, in the 40% chance case the odds don’t look so bad. This style of visualiza¬ 
tion, where we show specific potential outcomes, is called a discrete outcome visuali¬ 
zation, and the act of visualizing a probability as a frequency is called frequency 
framing. We are framing the probabilistic nature of a result in terms of easily under¬ 
stood frequencies of outcomes. 
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Figure 16-1. Visualizing probability as frequency. There are 100 squares in each grid, 
and each square represents either success of failure in some random trial. A 1% chance of 
success corresponds to 1 dark and 99 light squares, a 10% chance of success corresponds 
to 10 dark and 90 light squares, and a 40% chance of success corresponds to 40 dark and 
60 light squares. By randomly placing the dark squares among the light squares, we can 
create a visual impression of randomness that emphasizes the uncertainty of the out¬ 
come of a single trial. 

If we are only interested in two discrete outcomes, success or failure, then a visualiza¬ 
tion such as Figure 16-1 works fine. However, often we are dealing with more com¬ 
plex scenarios where the outcome of a random trial is a numeric variable. One 
common scenario is that of election predictions, where we are interested not only in 
who will win but also by how much. Let’s consider a hypothetical example of an 
upcoming election with two parties, the yellow party and the blue party. Assume you 
hear on the radio that the blue party is predicted to have a 1 percentage point advan¬ 
tage over the yellow party, with a margin of error of 1.76 percentage points. What 
does this information tell you about the likely outcome of the election? It is human 
nature to hear “the blue party will win,” but reality is more complicated. First, and 
most importantly, there are a range of different possible outcomes. The blue party 
could end up winning with a lead of two percentage points, or the yellow party could 
end up winning with a lead of half a percentage point. The range of possible out¬ 
comes with their associated likelihoods is called a probability distribution, and we can 
draw it as a smooth curve that rises and then falls over the range of possible outcomes 
(Figure 16-2). The higher the curve for a specific outcome, the more likely that out¬ 
come is. Probability distributions are closely related to the histograms and kernel 
densities discussed in Chapter 7, and you may want to re-read that chapter to refresh 
your memory. 
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Figure 16-2. Hypothetical prediction of an election outcome. The blue party is predicted 
to win over the yellow party by approximately 1 percentage point (labeled “best esti¬ 
mate"), but that prediction has a margin of error (here drawn so it covers 95% of the 
likely outcomes, 1.76 percentage points in either direction from the best estimate). The 
area shaded in blue, corresponding to 87.1% of the total, represents all outcomes under 
which blue would win. Likewise, the area shaded in yellow, corresponding to 12.9% of 
the total, represents all outcomes under which yellow would win. In this example, blue 
has an 87% chance of winning the election. 

By doing some math, we can calculate that for our made-up example, the chance of 
the yellow party winning is 12.9%. So, the chance of yellow winning is a tad better 
than the 10% chance scenario shown in Figure 16-1. If you favor the blue party, you 
may not be overly worried, but the yellow party has enough of a chance of winning 
that it might just be successful. If you compare Figure 16-2 to Figure 16-1, you may 
find that Figure 16-1 creates a much better sense of the uncertainty in outcome, even 
though the shaded areas in Figure 16-2 accurately represent the probabilities of blue 
or yellow winning. This is the power of a discrete outcome visualization. Research in 
human perception shows that we are much better at perceiving, counting, and judg¬ 
ing the relative frequencies of discrete objects—as long as their total number is not 
too large—than we are at judging the relative sizes of different areas. 

We can combine the discrete outcome nature of Figure 16-1 with a continuous distri¬ 
bution as in Figure 16-2 by drawing a quantile dot plot [Kay et al. 2016]. In the quan¬ 
tile dot plot, we subdivide the total area under the curve into evenly sized units and 
draw each unit as a circle. We then stack the circles such that their arrangement 
approximately represents the original distribution curve (Figure 16-3). 
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Figure 16-3. Quantile dot plot representations of the election outcome distribution of 
Figure 16-2. (a) The smooth distribution is approximated with 50 dots representing a 2% 
chance each. The 6yellow dots thus correspond to a 12% chance, reasonably close to the 
true value of 12.9%. (b) The smooth distribution is approximated with 10 dots represent¬ 
ing a 10% chance each. The 1 yellow dot thus corresponds to a 10% chance, still close to 
the true value. Quantile dot plots with a smaller number of dots tend to be easier to 
read, so in this example, the 10-dot version might be preferable to the 50-dot version. 

As a general principle, quantile dot plots should use a small to moderate number of 
dots. If there are too many dots, then we tend to perceive them as a continuum rather 
than as individual, discrete units. This negates the advantages of the discrete plots. 
Figure 16-3 shows variants with 50 dots (Figure 16-3a) and with 10 dots 
(Figure 16-3b). While the version with 50 dots more accurately captures the true 
probability distribution, the number of dots is too large to easily discriminate individ¬ 
ual ones. The version with 10 dots more immediately conveys the relative chances of 
blue or yellow winning. One objection to the 10-dot version might be that it is not 
very precise. We are underrepresenting the chance of yellow winning by 2.9 percent¬ 
age points. However, it is often worthwhile to trade some mathematical precision for 
more accurate human perception of the resulting visualization, in particular when 
communicating to a lay audience. A visualization that is mathematically correct but 
not properly perceived is not that useful in practice. 
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Visualizing the Uncertainty of Point Estimates 

In Figure 16-2, 1 showed a “best estimate” and a “margin of error,” but I didn’t explain 
what exactly these quantities are or how they might be obtained. To understand them 
better, we need to take a quick detour into basic concepts of statistical sampling. In 
statistics, our overarching goal is to learn something about the world by looking at a 
small portion of it. To continue with the election example, assume there are many dif¬ 
ferent electoral districts and the citizens of each district are going to vote for either 
the blue or the yellow party. We might want to predict how each district is going to 
vote, as well as the overall vote average across districts (the mean). To make a predic¬ 
tion before the election, we cannot poll each individual citizen in each district about 
how they are going to vote. Instead, we have to poll a subset of citizens in a subset of 
districts and use that data to arrive at a best guess. In statistical language, the total set 
of possible votes of all citizens in all districts is called the population, and the subset of 
citizens and/or districts we poll is the sample. The population represents the underly¬ 
ing true state of the world and the sample is our window into that world. 

We are normally interested in specific quantities that summarize important proper¬ 
ties of the population. In the election example, these could be the mean vote outcome 
across districts or the standard deviation among district outcomes. Quantities that 
describe the population are called parameters, and they are generally not knowable. 
However, we can use a sample to make a guess about the true parameter values, and 
statisticians refer to such guesses as estimates. The sample mean (or average) is an 
estimate for the population mean, which is a parameter. The estimates of individual 
parameter values are also called point estimates, since each can be represented by a 
point on a line. 

Figure 16-4 shows how these key concepts are related to each other. The variable of 
interest (e.g., vote outcome in each district) has some distribution in the population, 
with a population mean and a population standard deviation. A sample will consist of 
a set of specific observations. The number of individual observations in the sample is 
called the sample size. From the sample we can calculate a sample mean and a sample 
standard deviation, and these will generally differ from the population mean and 
standard deviation. Finally, we can define a sampling distribution, which is the distri¬ 
bution of estimates we would obtain if we repeated the sampling process many times. 
The width of the sampling distribution is called the standard error, and it tells us how 
precise our estimates are. In other words, the standard error provides a measure of 
the uncertainty associated with our parameter estimate. As a general rule, the larger 
the sample size, the smaller the standard error and thus the less uncertain the 
estimate. 
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Figure 16-4. Key concepts of statistical sampling. The variable of interest that we are 
studying has some true distribution in the population, with a true population mean and 
standard deviation. Any finite sample of that variable will have a sample mean and 
standard deviation that differ from the population parameters. If we sampled repeatedly 
and calculated a mean each time, then the resulting means would be distributed accord¬ 
ing to the sampling distribution of the mean. The standard error provides information 
about the width of the sampling distribution, which informs us about how precisely we 
are estimating the parameter of interest (here, the population mean). 

It is critical that we don’t confuse the standard deviation and the standard error. The 
standard deviation is a property of the population. It tells us how much spread there 
is among individual observations we could make. For example, if we consider the 
population of voting districts, the standard deviation tells us how districts are differ¬ 
ent from one another. By contrast, the standard error tells us how precisely we have 
determined a parameter estimate. If we wanted to estimate the mean voting outcome 
over all districts, the standard error would tell us how accurate our estimate for the 
mean is. 

All statisticians use samples to calculate parameter estimates and their uncertainties. 
However, they are divided in how they approach these calculations, into Bayesians 
and frequentists. Bayesians assume that they have some prior knowledge about the 
world, and they use the sample to update this knowledge. By contrast, frequentists 
attempt to make precise statements about the world without having any prior knowl¬ 
edge in hand. Fortunately, when it comes to visualizing uncertainty, Bayesians and 
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frequentists can generally employ the same types of strategies. Here, I will first discuss 
the frequentist approach and then describe a few specific issues unique to the Baye¬ 
sian context. 

Frequentists most commonly visualize uncertainty with error bars. While error bars 
can be useful as a visualization of uncertainty, they are not without problems, as I 
already alluded to in Chapter 9 (see Figure 9-1). It is easy for readers to be confused 
about what an error bar represents. To highlight this problem, in Figure 16-5 I show 
five different uses of error bars for the same dataset. The dataset contains expert rat¬ 
ings of chocolate bars, rated on a scale from 1 to 5, for chocolate bars manufactured 
in a number of different countries. For Figure 16-5 I have extracted all ratings for 
chocolate bars manufactured in Canada. Underneath the sample, which is shown as a 
strip chart of jittered dots, we see the sample mean plus/minus the standard deviation 
of the sample, the sample mean plus/minus the standard error, and 80%, 95%, and 
99% confidence intervals. All live error bars are derived from the variation in the 
sample, and they are all mathematically related, but they have different meanings. 
They are also visually quite distinct. 
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chocolate flavor rating 

Figure 16-5. Relationship between sample, sample mean, standard deviation, standard 
error, and confidence intervals, in an example of chocolate bar ratings. The observations 
(shown as jittered green dots) that make up the sample represent expert ratings of 125 
chocolate bars from manufacturers in Canada, rated on a scale from 1 (unpleasant) to 5 
(elite). The large orange dot represents the mean of the ratings. Error bars indicate, from 
top to bottom, twice the standard deviation, twice the standard error (standard devia¬ 
tion of the mean), and 80%, 95%, and 99% confidence intervals of the mean. Data 
source: Brady Brelinski, Manhattan Chocolate Society. 
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Whenever you visualize uncertainty with error bars, you must 
specify what quantity and/or confidence level the error bars 
represent. 


The standard error is approximately given by the sample standard deviation divided 
by the square root of the sample size, and confidence intervals are calculated by mul¬ 
tiplying the standard error with small, constant values. For example, a 95% confi¬ 
dence interval extends approximately two times the standard error in either direction 
from the mean. Therefore, larger samples tend to have narrower standard errors and 
confidence intervals, even if their standard deviation is the same. We can see this 
effect when we compare ratings for chocolate bars from Canada to ones from Swit¬ 
zerland (Figure 16-6). The mean rating and sample standard deviation are compara¬ 
ble between Canadian and Swiss chocolate bars, but we have ratings for 125 Canadian 
bars and only 38 Swiss bars, and consequently the confidence intervals around the 
mean are much wider in the case of Swiss bars. 
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Figure 16-6. Confidence intervals widen with smaller sample size. Chocolate bars from 
Canada and Switzerland have comparable mean ratings and comparable standard devi¬ 
ations (indicated with simple black error bars). However, over three times as many 
Canadian bars were rated as Swiss bars, and therefore the confidence intervals (indica¬ 
ted with error bars of different colors and thickness drawn on top of one another) are 
substantially wider for the mean of the Swiss ratings than for the mean of the Canadian 
ratings. Data source: Brady Brelinski, Manhattan Chocolate Society. 

In Figure 16-6, I am showing three different confidence intervals at the same time, 
using darker colors and thicker lines for the intervals representing lower confidence 
levels. I refer to these visualizations as graded error bars. The grading helps the reader 
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perceive that there is a range of different possibilities. If I showed simple error bars 
(without grading) to a group of people, chances are at least some of them would per¬ 
ceive the error bars deterministically, for example as representing the minimum and 
maximum of the data. Alternatively, they might think the error bars delineate the 
range of possible parameter estimates—i.e., that the estimate could never fall outside 
the error bars. These types of misperceptions are called deterministic construal errors. 
The more we can minimize the risk of deterministic construal error, the better our 
visualization of uncertainty. 

Error bars are convenient because they allow us to show many estimates with their 
uncertainties all at once. Therefore, they are commonly used in scientific publica¬ 
tions, where the primary goal is usually to convey a large amount of information to 
an expert audience. As an example of this type of application, Figure 16-7 shows 
mean chocolate ratings and associated confidence intervals for chocolate bars manu¬ 
factured in six different countries. 
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Figure 16-7. Mean chocolate flavor ratings and associated confidence intervals for choc¬ 
olate bars from manufacturers in six different countries. Data source: Brady Brelinski, 
Manhattan Chocolate Society. 

When looking at Figure 16-7, you may wonder what it tells us about the differences 
in mean ratings. The mean ratings of Canadian, Swiss, and Austrian bars are higher 
than the mean rating of US bars, but given the uncertainty in these mean ratings, are 
the differences in means significant ? The word “significant” here is a technical term 
used by statisticians. We call a difference significant if with some level of confidence 
we can reject the assumption that the observed difference was caused by random 
sampling. Since only a finite number of Canadian and US bars were rated, the raters 
could have accidentally considered more of the better Canadian bars and fewer of the 
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better US bars, and this random chance might look like a systematic rating advantage 
of Canadian over US bars. 

Assessing significance from Figure 16-7 is difficult, because both the mean Canadian 
rating and the mean US rating have uncertainty. Both uncertainties matter to the 
question whether the means are different. Statistics textbooks and online tutorials 
sometimes publish rules of thumb of how to judge significance from the extent to 
which error bars do or don’t overlap. However, these rules of thumb are not reliable 
and should be avoided. The correct way to assess whether there are differences in 
mean rating is to calculate confidence intervals for the differences. If those confi¬ 
dence intervals exclude zero, then we know the difference is significant at the respec¬ 
tive confidence level. For the chocolate ratings dataset, we see that only bars from 
Canada are significantly higher-rated than bars from the US (Figure 16-8). For bars 
from Switzerland, the 95% confidence interval on the difference just barely includes 
the value zero. Thus, there is a slightly bigger than 5% chance that the observed dif¬ 
ference between mean ratings of US and Swiss chocolate bars represents nothing 
more than sampling variation. Finally, there is no evidence at all that Austrian bars 
have systematically higher mean ratings than US bars. 
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Figure 16-8. Mean chocolate flavor ratings for manufacturers from five different coun¬ 
tries, relative to the mean rating of US chocolate bars. Canadian chocolate bars are rated 
significantly higher than US bars. For the other four countries there is no significant 
difference in mean rating compared to the US at the 95% confidence level. Confidence 
levels have been adjusted for multiple comparisons using Dunnett’s method. Data 
source: Brady Brelinski, Manhattan Chocolate Society. 


In the preceding figures, I have used two different types of error bars, graded and 
simple. More variations are possible. For example, we can draw error bars with or 
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without a cap at the end (Figure 16-9a,c versus Figure 16-9b,d). There are advantages 
and disadvantages to all these choices. Graded error bars highlight the existence of 
different ranges corresponding to different confidence levels. However, the flip side of 
this additional information is added visual noise. Depending on how complex and 
information-dense a figure is otherwise, simple error bars may be preferable to gra¬ 
ded ones. Whether to draw error bars with or without cap is primarily a question of 
personal taste. A cap highlights where exactly an error bar ends (Figure 16-9a,c), 
whereas an error bar without a cap puts equal emphasis on the entire range of the 
interval (Figure 16-9b,d). Also, again, caps add visual noise, so in a figure with many 
error bars omitting caps may be preferable. 
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Figure 16-9. Mean chocolate flavor ratings for manufacturers from four different coun¬ 
tries, relative to the mean rating of US chocolate bars. Each panel uses a different 
approach to visualizing the same uncertainty information: (a) graded error bars with 
caps; (b) graded error bars without caps; (c) single-interval error bars with caps; 

(d) single-interval error bars without caps; (e) confidence strips; (f) confidence distribu¬ 
tions. Data source: Brady Brelinski, Manhattan Chocolate Society. 
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As an alternative to error bars, we could draw confidence strips that gradually fade 
into nothing (Figure 16-9e). Confidence strips better convey how probable different 
values are, but they are difficult to read. We would have to visually integrate the dif¬ 
ferent shadings of color to determine where a specific confidence level ends. From 
Figure 16-9e we might conclude that the mean rating for Peruvian chocolate bars is 
significantly lower than that of US chocolate bars, and yet this is not the case. Similar 
problems arise when we show explicit confidence distributions (Figure 16-9f). It is 
difficult to visually integrate the area under the curve and to determine where exactly 
a given confidence level is reached. This issue can be somewhat alleviated, however, 
by drawing quantile dot plots as in Figure 16-3. 

For simple 2D figures, error bars have one important advantage over more complex 
displays of uncertainty: they can be combined with many other types of plots. For 
nearly any visualization we may have, we can add some indication of uncertainty by 
adding error bars. For example, we can show amounts with uncertainty by drawing a 
bar plot with error bars (Figure 16-10). This type of visualization is commonly used 
in scientific publications. We can also draw error bars along both the x and the y 
direction in a scatterplot (Figure 16-11). 
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Figure 16-10. Mean butterfat contents in the milk of four cattle breeds. Error bars indi¬ 
cate +/- one standard error of the mean. Visualizations of this type are frequently seen 
in the scientific literature. While they are technically correct, they represent neither the 
variation within each category nor the uncertainty of the sample means particularly 
well. See Figure 7-11 for the variation in butterfat contents within individual breeds. 
Data source: Canadian Record of Performance for Purebred Dairy Cattle. 
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Figure 16-11. Median income versus median age for 67 counties in Pennsylvania. Error 
bars represent 90% confidence intervals. Data source: 2015 Five-Year American Com¬ 
munity Survey. 

Let’s return to the topic of frequentists and Bayesians. Frequentists assess uncertainty 
with confidence intervals, whereas Bayesians calculate posterior distributions and 
credible intervals. The Bayesian posterior distribution tells us how likely specific 
parameter estimates are given the input data. The credible interval indicates a range 
of values in which the parameter value is expected with a given probability, as calcula¬ 
ted from the posterior distribution. For example, a 95% credible interval corresponds 
to the center 95% of the posterior distribution. The true parameter value has a 95% 
chance of lying in the 95% credible interval. 

If you are not a statistician, you may be surprised by my definition of a credible inter¬ 
val. You may have thought that it was actually the definition of a confidence interval. 
It is not. A Bayesian credible interval tells you about where the true parameter likely 
is, and a frequentist confidence interval tells you about where the true parameter 
likely is not. While this distinction may seem like semantics, there are important con¬ 
ceptual differences between the two approaches. Under the Bayesian approach, you 
use the data and your prior knowledge about the system under study (called the 
prior) to calculate a probability distribution (the posterior) that tells you where you 
can expect the true parameter value to lie. By contrast, under the frequentist 
approach, you first make an assumption that you intend to disprove. This assumption 
is called the null hypothesis, and it is often simply the assumption that the parameter 
equals zero (e.g., there is no difference between two conditions). You then calculate 
the probability that random sampling would generate data similar to what was 
observed if the null hypothesis were true. The confidence interval is a representation 
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of this probability. If a given confidence interval excludes the parameter value under 
the null hypothesis (i.e., the value zero), then you can reject the null hypothesis at that 
confidence level. Alternatively, you can think of a confidence interval as an interval 
that captures the true parameter value with the specified likelihood under repeated 
sampling (Figure 16-12). Thus, if the true parameter value were zero, a 95% confi¬ 
dence interval would only exclude zero in 5% of the samples analyzed. 
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Figure 16-12. Frequency interpretation of a confidence interval. Confidence intervals 
(Cls) are best understood in the context of repeated sampling. For each sample, a specific 
confidence interval either includes (green) or excludes (orange) the true parameter, here 
the mean. However, if we sample repeatedly, then the confidence intervals (shown here 
are 68% confidence intervals, corresponding to sample mean +/- standard error) include 
the true mean approximately 68% of the time. 

To summarize, a Bayesian credible interval makes a statement about the true parame¬ 
ter value, and a frequentist confidence interval makes a statement about the null 
hypothesis. In practice, however, Bayesian and frequentist estimates are often quite 
similar (Figure 16-13). Once conceptual advantage of the Bayesian approach is that it 
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emphasizes thinking about the magnitude of an effect, whereas the frequentist think¬ 
ing emphasizes a binary perspective of an effect either existing or not. 
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Figure 16-13. Comparison of frequentist confidence intervals and Bayesian credible 
intervals for mean chocolate ratings. We see that the two approaches yield similar but 
not exactly identical results. In particular, the Bayesian estimates display a small 
amount of shrinkage, which is an adjustment of the most extreme parameter estimates 
toward the overall mean. (Note how the Bayesian estimate for Switzerland is slightly 
moved to the left and the Bayesian estimate for Peru is slightly moved to the right rela¬ 
tive to the respective frequentist estimates.) The frequentist estimates and confidence 
intervals shown here are identical to the results for 95% confidence shown in 
Figure 16-7. Data source: Brady Brelinski, Manhattan Chocolate Society. 



A Bayesian credible interval answers the question, “Where do we 
expect the true parameter value to lie?” A frequentist confidence 
interval answers the question, “How certain are we that the true 
parameter value is not zero?” 


The central goal of Bayesian estimation is to obtain the posterior distribution. There¬ 
fore, Bayesians commonly visualize the entire distribution rather than simplifying it 
into a credible interval. So, in terms of data visualization, all the approaches to visual¬ 
izing distributions discussed in Chapters 7, 8, and 9 are applicable. Specifically, histo¬ 
grams, density plots, boxplots, violins, and ridgeline plots are all commonly used to 
visualize Bayesian posterior distributions. Since these approaches have been 
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discussed at length in their respective chapters, I will here show only one example, 
using a ridgeline plot to show Bayesian posterior distributions of mean chocolate rat¬ 
ings (Figure 16-14). In this specific case, I have added shading under the curve to 
indicate defined regions of posterior probabilities. As an alternative to shading, I 
could also have drawn quantile dot plots, or I could have added graded error bars 
underneath each distribution. Ridgeline plots with error bars underneath are called 
half-eyes, and violin plots with error bars are called eye plots (see “Uncertainty” on 
page 43). 
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Figure 16-14. Bayesian posterior distributions of mean chocolate bar ratings, shown as a 
ridgeline plot. The red dots represent the medians of each posterior distribution. Because 
it is difficult to convert a continuous distribution into specific confidence regions by eye, 
I have added shading under each curve to indicate the center 80%, 95%, and 99% of 
each posterior distribution. Data source: Brady Brelinski, Manhattan Chocolate Society. 


Visualizing the Uncertainty of Curve Fits 

In Chapter 14, we discussed how to show a trend in a dataset by fitting a straight line 
or curve to the data. These trend estimates also have uncertainty, and it is customary 
to show the uncertainty in a trend line with a confidence band (Figure 16-15). The 
confidence band provides us with a range of different fit lines that would be compati¬ 
ble with the data. When students encounter a confidence band for the first time, they 
are often surprised that even a perfectly straight line fit produces a confidence band 
that is curved. The reason for the curvature is that the straight line fit can move in 
two distinct directions: it can move up and down (i.e., have different intercepts), and 
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it can rotate (i.e., have different slopes). We can visually show how the confidence 
band arises by drawing a set of alternative fit lines randomly generated from the pos¬ 
terior distribution of the fit parameters. This is done in Figure 16-16, which shows 15 
randomly chosen alternative fits. We see that even though each line is perfectly 
straight, the combination of different slopes and intercepts of each line generates an 
overall shape that looks just like the confidence band. 



body mass (g) 

Figure 16-15. Head length versus body mass for male blue jays, as in Figure 14-7. The 
straight blue line represents the best linear fit to the data, and the gray band around the 
line shows the uncertainty in the linear fit. The gray band represents a 95% confidence 
level. Data source: Keith Tarvin, Oberlin College. 
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body mass (g) 

Figure 16-16. Head length versus body mass for male blue jays. In contrast to 
Figure 16-15, the straight blue lines now represent equally likely alternative fits ran¬ 
domly drawn from the posterior distribution. Data source: Keith Tarvin, Oberlin 
College. 

To draw a confidence band, we need to specify a confidence level, and just as we saw 
for error bars and posterior probabilities, it can be useful to highlight different levels 
of confidence. This leads us to the graded confidence band, which shows several confi¬ 
dence levels at once (Figure 16-17). A graded confidence band enhances the sense of 
uncertainty in the reader, and it forces the reader to confront the possibility that the 
data might support different alternative trend lines. 
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Figure 16-17. Head length versus body mass for male blue jays. As in the case of error 
bars, we can draw graded confidence bands to highlight the uncertainty in the estimate. 
Data source: Keith Tarvin, Oberlin College. 

We can also draw confidence bands for nonlinear curve fits. Such confidence bands 
look nice but can be difficult to interpret (Figure 16-18). If we look at Figure 16- 18a, 
we may think that the confidence band arises by moving the blue line up and down 
and maybe deforming it slightly. However, as Figure 16- 18b reveals, the confidence 
band represents a family of curves that are all quite a bit more wiggly than the overall 
best fit shown in part (a). This is a general principle of nonlinear curve fits. Uncer¬ 
tainty corresponds not just to a movement of the curve up and down but also to 
increased wiggliness. 
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Figure 16-18. Fuel efficiency versus displacement, for 32 cars (1973-74 models). Each 
dot represents one car, and the smooth lines were obtained by fitting a cubic regression 
spline with 5 knots, (a) Best fit spline and confidence band, (b) Equally likely alternative 
fits drawn from the posterior distribution. Data source: Motor Trend, 1974. 


Hypothetical Outcome Plots 


All static visualizations of uncertainty suffer from the problem that viewers may 
interpret some aspect of the uncertainty visualization as a deterministic feature of the 
data (a deterministic construal error, as described previously). We can avoid this 
problem by visualizing uncertainty through animation, by cycling through a number 
of different but equally likely plots. This kind of visualization is called a hypothetical 
outcome plot (HOP) [Hullman, Resnick, and Adar 2015]. While HOPs are not possi¬ 
ble in a print medium, they can be very effective in online settings where animated 
visualizations can be provided in the form of GIFs or MP4 videos. HOPs can also 
work well in the context of an oral presentation. 

To illustrate the concept of a HOP, let’s go back once more to chocolate bar ratings. 
When you are standing in the grocery store thinking about buying some chocolate, 
you probably don’t care about the mean flavor rating and associated uncertainty for 
certain groups of chocolate bars. Instead, you might want to know the answer to a 
simpler question, such as: if I randomly pickup a Canadian- and a US-manufactured 
chocolate bar, which one of the two should I expect to taste better? To arrive at an 
answer to this question, we could randomly select a Canadian and a US bar from the 
dataset, compare their ratings, record the outcome, and then repeat this process many 
times. If we did this, we would find that in approximately 53% of the cases the Cana¬ 
dian bar will be ranked higher, and in 47% of the cases either the US bar is ranked 
higher or the two bars are tied. We can show this process visually by cycling between 
several of these random draws and showing the relative ranking of the two bars for 
each draw (Figure 16-19). 
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Figure 16-19. Schematic of a hypothetical outcome plot for chocolate bar ratings of 
Canadian- and US-manufactured bars. Each vertical green bar represents the rating for 
one bar, and each panel shows a comparison of two randomly chosen bars, one each 
from a Canadian manufacturer and a US manufacturer. In an actual hypothetical out¬ 
come plot, the display would cycle between the distinct plot panels instead of showing 
them side-by-side. Data source: Brady Brelinski, Manhattan Chocolate Society. 

As a second example, consider the variation in shapes among equally probable trend 
lines in Figure 16-18b. Because all trend lines are plotted on top of one another, we 
primarily perceive the overall area that is covered by trend lines, which is similar to 
the confidence band. Perceiving individual trend lines is difficult. By turning this fig¬ 
ure into a HOP, we can highlight individual trend lines one at a time (Figure 16-20). 

When preparing a HOP, you may wonder whether it is better to make a hard switch 
between different outcomes (as in a slide projector) or rather smoothly animate from 
one outcome to the next (e.g., slowly deform the trend line for one outcome until it 
looks like the trend line for another outcome). While this is to some extent an open 
question that continues to be researched, some evidence indicates that smooth transi¬ 
tions make it harder to judge about the probabilities represented [Kale et al. 2018]. If 
you consider animating between outcomes, you may want to at least make these ani¬ 
mations very fast, or choose an animation style where outcomes fade in and out 
rather than deform from one to the other. 
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Figure 16-20. Schematic of a hypothetical outcome plot for fuel efficiency versus dis¬ 
placement. Each dot represents one car, and the smooth lines were obtained by fitting a 
cubic regression spline with 5 knots. Each line in each panel represents one alternative fit 
outcome, drawn from the posterior distribution of the fit parameters. In an actual hypo¬ 
thetical outcome plot, the display would cycle between the distinct plot panels instead of 
showing them side-by-side. Data source: Motor Trend, 1974. 

There is one critical aspect we need to pay attention to when preparing a HOP: we 
need to make sure that the outcomes we do show are representative of the true distri¬ 
bution of possible outcomes. Otherwise, our HOP could be rather misleading. For 
example, going back to the case of chocolate ratings, if I randomly selected 10 out¬ 
come pairs of chocolate bars and among those the US bar was rated higher than the 
Canadian bar in 7 cases, then the HOP would erroneously create the impression that 
US bars tend to be rated higher than Canadian bars. We can prevent this issue either 
by choosing a very large number of outcomes, so sampling biases are unlikely, or by 
verifying in some form that the outcomes that are shown are appropriate. When 
making Figure 16-19, 1 verified that the number of times the Canadian bar was shown 
winning was close to the true percentage of 53%. 
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PART II 


Principles of Figure Design 




CHAPTER 17 


The Principle of Proportional Ink 


In many different visualization scenarios, we represent data values by the extent of a 
graphical element. For example, in a bar plot, we draw bars that begin at 0 and end at 
the data value they represent. In this case, the data value is not only encoded in the 
endpoint of the bar but also in the height or length of the bar. If we drew a bar that 
started at a different value than 0, then the length of the bar and the bar endpoint 
would convey contradicting information. Such figures are internally inconsistent, 
because they show two different values with the same graphical element. Contrast this 
to a scenario where we visualize the data value with a dot. In this case, the value is 
only encoded in the location of the dot, not in the size or shape of the dot. 

Similar issues will arise whenever we use graphical elements such as bars, rectangles, 
shaded areas of arbitrary shape, or any other elements that have a defined visual 
extent which can be either consistent or inconsistent with the data value shown. In all 
these cases, we need to make sure that there is no inconsistency. This concept has 
been termed as the principle of proportional ink [Bergstrom and West 2016]: 

When a shaded region is used to represent a numerical value, the area of that shaded 

region should be directly proportional to the corresponding value. 

(It is common practice to use the word “ink” to refer to any part of a visualization that 
deviates from the background color. This includes lines, points, shared areas, and 
text. In this chapter, however, we are talking primarily about shaded areas.) Viola¬ 
tions of this principle are quite common, in particular in the popular press and in the 
world of finance. 
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Visualizations Along Linear Axes 

We first consider the most common scenario, visualization of amounts along a linear 
scale. Figure 17-1 shows the median income in the five counties that make up the 
state of Hawaii. It is a typical figure one might encounter in a newspaper article. A 
quick glance at the figure suggests that the county of Hawaii is incredibly poor while 
the county of Honolulu is much richer than the other counties. However, Figure 17-1 
is quite misleading, because all the bars begin at $50,000 median income. Thus, while 
the endpoint of each bar correctly represents the actual median income in each 
county, the bar height represents the extent to which median incomes exceed $50,000, 
an arbitrary number. And human perception is such that the bar height is the key 
quantity we perceive when looking at this figure, not the location of the bar endpoint 
relative to the y axis. 
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Figure 17-1. Median income in the five counties of the state of Hawaii. This figure is mis¬ 
leading, because the y-axis scale starts at $50,000 instead of$0. As a result, the bar 
heights are not proportional to the values shown, and the income differential between 
the county of Hawaii and the other four counties appears much bigger than it actually is. 
Data source: 2015 Five-Year American Community Survey. 

An appropriate visualization of this dataset makes for a less exciting story 
(Figure 17-2). While there are differences in median income between the counties, 
they are nowhere near as big as Figure 17-1 suggested. Overall, the median incomes 
in the different counties are somewhat comparable. 
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Figure 17-2. Median income in the five counties of the state of Hawaii. Here, they-axis 
scale starts at $0 and therefore the relative magnitudes of the median incomes in the five 
counties are accurately shown. Data source: 2015 Five-Year American Community 
Survey. 



Bars on a linear scale should always start at 0. 


Similar visualization problems frequently arise in the visualization of time series, such 
as those of stock prices. Figure 17-3 suggests a massive collapse in the stock price of 
Facebook occurred around Nov. 1, 2016. In reality, the price decline was moderate 
relative to the total price of the stock (Figure 17-4). The y-axis range in Figure 17-3 
would be questionable even without the shading underneath the curve. But with the 
shading, the figure becomes particularly problematic. The shading emphasizes the 
distance from the location of the x axis to the specific y values shown, and thus it cre¬ 
ates the visual impression that the height of the shaded area at a given day represents 
the stock price of that day. Instead, it only represents the difference in the stock price 
from the baseline, which is $110 in Figure 17-3. 
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Figure 17-3. Stock price ofFacebook (FB)from Oct. 22, 2016 to Jan. 21, 2017. This fig¬ 
ure seems to imply that the FB stock price collapsed around Nov. 1, 2016. However, this 
is misleading, because they axis starts at $110 instead of$0. Data source: Yahoo! 
Finance. 



Figure 17-4. Stock price ofFacebook (FB)from Oct. 22, 2016 to Jan. 21, 2017. By show¬ 
ing the stock price on ay scale from $0 to $150, this figure more accurately relays the 
magnitude of the FB price drop around Nov. 1, 2016. Data source: Yahoo! Finance. 
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The examples of Figures 17-2 and 17-4 could suggest that bars and shaded areas are 
not useful to represent small changes over time or differences between conditions, 
since we always have to draw the whole bar or area starting from 0. However, this is 
not the case. It is perfectly valid to use bars or shaded areas to show differences 
between conditions, as long as we make it explicit which differences we are showing. 
For example, we can use bars to visualize the change in median income in Hawaiian 
counties from 2010 to 2015 (Figure 17-5). For all counties except Kalawao, this 
change amounts to less than $5,000. (Kalawao is an unusual county, with fewer than 
100 inhabitants, and it can experience large swings in median income from a small 
number of people moving into or out of the county.) And for Hawaii County, the 
change is negative; i.e., the median income in 2015 was lower than it was in 2010. We 
represent negative values by drawing bars that go in the opposite direction, extending 
downward from 0 rather than up. 
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Figure 17-5. Change in median income in Hawaiian counties from 2010 to 2015. Data 
source: 2010 and 2015 Five-Year American Community Surveys. 

Similarly, we can draw the change in Facebook stock price over time as the difference 
from its temporary high point on Oct. 22, 2016 (Figure 17-6). By shading an area that 
represents the distance from the high point, we are accurately representing the abso¬ 
lute magnitude of the price drop without making any implicit statement about the 
magnitude of the price drop relative to the total stock price. 
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Figure 17-6. Decline in Facebook (FB) stock price relative to the price of Oct. 22, 2016. 
Between Nov. 1, 2016 and Jan. 1, 2017, the price remained approximately $15 lower 
than it was at its high point on Oct. 22, 2016. The price started to recover in January. 
Data source: Yahoo! Finance. 

Visualizations Along Logarithmic Axes 

When we are visualizing data along a linear scale, the areas of bars, rectangles, or 
other shapes are automatically proportional to the data values. The same is not true if 
we are using a logarithmic scale, because data values are not linearly spaced along the 
axis. Therefore, one could argue that, for example, bar graphs on a log scale are inher¬ 
ently flawed. On the flip side, the area of each bar will be proportional to the loga¬ 
rithm of the data value, and thus bar graphs on a log scale satisfy the principle of 
proportional ink in log-transformed coordinates. In practice, I think neither of these 
two arguments can resolve whether log-scale bar graphs are appropriate. Instead, the 
relevant question is whether we want to visualize amounts or ratios. 

In Chapter 3, 1 have explained that a log scale is the natural scale to visualize ratios, 
because a unit step along a log scale corresponds to multiplication with or division by 
a constant factor. In practice, however, log scales are often used not specifically to vis¬ 
ualize ratios but rather just because the numbers shown vary over many orders of 
magnitude. As an example, consider the gross domestic products (GDPs) of countries 
in Oceania. In 2007, these varied from less than a billion US dollars (USD) to over 
300 billion USD (Figure 17-7). Visualizing these numbers on a linear scale would not 
work, because the two countries with the largest GDPs (New Zealand and Australia) 
would dominate the figure. 
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Figure 17-7. GDP in 2007 of countries in Oceania. The lengths of the bars do not accu¬ 
rately reflect the data values shown, since bars start at the arbitrary value of 0.3 billion 
USD. Data source: Gapminder. 

However, the visualization with bars on a log scale (Figure 17-7) does not work either. 
The bars start at an arbitrary value of 0.3 billion USD, and at a minimum the figure 
suffers from the same problem of Figure 17-1, that the bar lengths are not representa¬ 
tive of the data values. The added difficulty with a log scale, though, is that we cannot 
simply let the bars start at 0. In Figure 17-7, the value 0 would lie infinitely far to the 
left. Therefore, we could make our bars arbitrarily long by pushing their origin fur¬ 
ther and further away, as in Figure 17-8. This problem always arises when we try to 
visualize amounts (which is what the GDP values are) on a log scale. 
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Figure 17-8. GDP in 2007 of countries in Oceania. The lengths of the bars do not accu¬ 
rately reflect the data values shown, since bars start at the arbitrary value of 10~ 9 billion 
USD. Data source: Gapminder. 
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For the data of Figure 17-7, I think bars are inappropriate. Instead, we can simply 
place a dot at the appropriate location along the scale for each country’s GDP and 
avoid the issue of bar lengths altogether (Figure 17-9). Importantly, by placing the 
country names right next to the dots rather than along the y axis, we avoid generating 
the visual perception of a magnitude conveyed by the distance from the country 
name to the dot. 
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Figure 17-9. GDP in 2007 of countries in Oceania. Data source: Gapminder. 

If we want to visualize ratios rather than amounts, however, bars on a log scale are a 
perfectly good option. In fact, they are preferable to bars on a linear scale in that case. 
As an example, let’s visualize the GDP values of countries in Oceania relative to the 
GDP of Papua New Guinea. The resulting figure does a good job of highlighting the 
key relationships between the GDPs of the various countries (Figure 17-10). We can 
see that New Zealand has over 8 times the GDP of Papua New Guinea and Australia 
over 64 times, while Tonga and the Federated States of Micronesia each have less than 
l/16th the GDP of Papua New Guinea. French Polynesia and New Caledonia are 
close but have slightly smaller GDPs than Papua New Guinea. 
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Figure 17-10. GDP in 2007 of countries in Oceania, relative to the GDP of Papua New 
Guinea. Data source: Gapminder. 

Figure 17-10 also highlights that the natural midpoint of a log scale is 1, with bars 
representing numbers above 1 going in one direction and bars representing numbers 
below 1 going in the other direction. Bars on a log scale represent ratios and should 
always start at 1, and bars on a linear scale represent amounts and should always start 
at 0. 



When bars are drawn on a log scale, they represent ratios and need 
to be drawn starting from 1, not 0. 


Direct Area Visualizations 

All the preceding examples visualized data along one linear dimension, so that each 
data value was encoded both by area and by location along the x or y axis. In these 
cases, we can consider the area encoding as incidental and secondary to the location 
encoding of the data value. Other visualization approaches, however, represent the 
data values primarily or directly by area, without a corresponding location mapping. 
The most common one is the pie chart (Figure 17-11). Even though technically the 
data values are mapped onto angles, which are represented by location along a circu¬ 
lar axis, in practice we are typically not judging the angles of a pie chart. Instead, the 
dominant visual property we notice is the area of each pie wedge. 
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Figure 17-11. Number of inhabitants in Rhode Island counties, shown as a pie chart. 

Both the angle and the area of each pie wedge are proportional to the number of inhabi¬ 
tants in the respective county. Data source: 2010 US Decennial Census. 

Because the area of each pie wedge is proportional to its angle, which is proportional 
to the data value the wedge represents, pie charts satisfy the principle of proportional 
ink. However, we perceive the area in a pie chart differently from the same area in a 
bar plot. The fundamental reason is that human perception primarily judges distan¬ 
ces and not areas. Thus, if a data value is encoded entirely as a distance, as is the case 
with the length of a bar, we perceive it more accurately than when the data value is 
encoded through a combination of two or more distances that jointly create an area. 
To see this difference, compare Figure 17-11 to Figure 17-12, which shows the same 
data as bars. The difference in the number of inhabitants between Providence County 
and the other counties appears larger in Figure 17-12 than in Figure 17-11. 
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Figure 17-12. Number of inhabitants in Rhode Island counties, shown as bars. The 
length of each bar is proportional to the number of inhabitants in the respective county. 
Data source: 2010 US Decennial Census. 

The problem that human perception is better at judging distances than at judging 
areas also arises in treemaps (Figure 17-13), which can be thought of as square ver¬ 
sions of pie charts. Again, in comparison to Figure 17-12, the differences in the num¬ 
ber of inhabitants among the counties appears less pronounced in Figure 17-13. 




Kent 

166,000 


Bristol 49,900 

Washington 

127,000 

Newport 

82,900 


Providence 

627,000 


Figure 17-13. Number of inhabitants in Rhode Island counties, shown as a treemap. The 
area of each rectangle is proportional to the number of inhabitants in the respective 
county. Data source: 2010 US Decennial Census. 
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CHAPTER 18 


Handling Overlapping Points 


When we want to visualize large or very large datasets, we often experience the chal¬ 
lenge that simple x-y scatterplots do not work very well because many points lie on 
top of each other and partially or fully overlap. And similar problems can arise even 
in small datasets if data values were recorded with low precision or rounded, such 
that multiple observations have exactly the same numeric values. The technical term 
commonly used to describe this situation is overplotting, which means that we are 
plotting many points on top of each other. Here I describe several strategies you can 
pursue when encountering this challenge. 

Partial Transparency and Jittering 

We first consider a scenario with only a moderate number of data points but with 
extensive rounding. Our dataset contains fuel economy during city driving and 
engine displacement for 234 popular car models released between 1999 and 2008 
(Figure 18-1). In this dataset, fuel economy is measured in miles per gallon (mpg) 
and is rounded to the nearest integer value. Engine displacement is measured in liters 
and is rounded to the nearest deciliter. Due to this rounding, many car models have 
exactly identical values. For example, there are 21 cars total with 2.0 liter engine dis¬ 
placement, and as a group they have only four different fuel economy values: 19, 20, 
21, or 22 mpg. Therefore, in Figure 18-1 these 21 cars are represented by only four 
distinct points, so that 2.0 liter engines appear much less popular than they actually 
are. Moreover, the dataset contains two four-wheel drive cars with 2.0 liter engines, 
which are represented by black dots. However, these black dots are fully occluded by 
yellow dots, so that it looks like there are no four-wheel drive cars with a 2.0 liter 
engine. 
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Figure 18-1. City fuel economy versus engine displacement, for popular cars released 
between 1999 and 2008. Each point represents one car. The point color encodes the drive 
train: front-wheel drive (FWD), rear-wheel drive (RWD), or four-wheel drive (4WD). 
The figure is labeled “bad” because many points are plotted on top of others and obscure 
them. Data source: US Environmental Protection Agency (EPA), 
https://fueleconomy.gov. 

One way to ameliorate this problem is to use partial transparency. If we make indi¬ 
vidual points partially transparent, then overplotted points appear as darker points 
and thus the shade of the points reflects the density of points in that location of the 
graph (Figure 18-2). 

However, making points partially transparent is not always sufficient to solve the 
issue of overplotting. For example, even though we can see in Figure 18-2 that some 
points have a darker shade than others, it is difficult to estimate how many points 
were plotted on top of each other in each location. In addition, while the differences 
in shading are clearly visible, they are not self-explanatory. A reader who sees this fig¬ 
ure for the first time will likely wonder why some points are darker than others and 
will not realize that those points are in fact multiple points stacked on top of each 
other. A simple trick that helps in this situation is to apply a small amount of jitter to 
the points—i.e., to displace each point randomly by a small amount in either the x or 
the y direction or both. With jitter, it is immediately apparent that the darker areas 
arise from points that are plotted on top of each other (Figure 18-3). Also, now, for 
the first time the black dots that represent four-wheel drive cars with 2.0 liter engines 
can be seen. 
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Figure 18-2. City fuel economy versus engine displacement. Because points have been 
made partially transparent, points that lie on top of other points can now be identified 
by their darker shade. Data source: EPA. 



displacement (I) 

Figure 18-3. City fuel economy versus engine displacement. By adding a small amount of 
jitter to each point, we can increase the visibility of the overplotted points without sub¬ 
stantially distorting the message of the plot. Data source: EPA. 
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One downside of jittering is that it does change the data and therefore has to be per¬ 
formed with care. If we jitter too much, we end up placing points in locations that are 
not representative of the underlying dataset. The result is a misleading visualization 
of the data. See Figure 18-4 as an example. 
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Figure 18-4. City fuel economy versus engine displacement. By adding too much jitter to 
the points, we have created a visualization that does not accurately reflect the underlying 
dataset. Data source: EPA. 

2D Histograms 

When the number of individual points gets very large, partial transparency (with or 
without jittering) will not be sufficient to resolve the overplotting issue. What will 
typically happen is that areas with high point density will appear as uniform blobs of 
dark color, while in areas with low point density the individual points are barely visi¬ 
ble (Figure 18-5). And changing the transparency level of individual points will either 
ameliorate one or the other of these problems while worsening the other; no trans¬ 
parency setting can address both at the same time. 
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Figure 18-5. Departure delay in minutes versus flight departure time, for all flights 
departing Newark Airport (EWR) in 2013. Each dot represents one departure. Data 
source: US Dept, of Transportation, Bureau of Transportation Statistics. 

Figure 18-5 shows departure delays for over 100,000 individual flights, with each dot 
representing one flight departure. Even though we have made the individual dots 
fairly transparent, the majority of them just form a black band at between 0 and 300 
minutes departure delay. This band obscures whether most flights depart approxi¬ 
mately on time or with a substantial delay (say, 50 minutes or more). At the same 
time, the most-delayed flights (with delays of 400 minutes or more) are barely visible 
due to the transparency of the dots. 

In such cases, instead of plotting individual points, we can make a 2D histogram. A 
2D histogram is conceptually similar to a ID histogram, as discussed in Chapter 7, 
but now we bin the data in two dimensions. We subdivide the entire x-y plane into 
small rectangles, count how many observations fall into each one, and then color the 
rectangles by those counts. Figure 18-6 shows the result of this approach for the 
departure delay data. This visualization highlights several important features of the 
flight departure data. First, the vast majority of departures during the day (from 6 
a.m. to about 9 p.m.) actually depart without delay or even early (negative delay). 
However, a modest number of departures have a substantial delay. Moreover, the later 
a plane departs in the day, the more of a delay it can have. Importantly, the departure 
time is the actual time of departure, not the scheduled time of departure, so this fig¬ 
ure does not necessarily tell us that planes scheduled to depart early never experience 
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delay. What it does tell us, though, is that if a plane departs early it either has little 
delay or, in very rare cases, a delay of around 900 minutes. 
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Figure 18-6. Departure delay in minutes versus the flight departure time. Each colored 
rectangle represents all flights departing at that time with that departure delay. Coloring 
represents the number of flights represented by that rectangle. Data source: US Dept, of 
Transportation, Bureau of Transportation Statistics. 

As an alternative to binning the data into rectangles, we can bin into hexagons [Carr 
et al. 1987], This approach has the advantage that the points in a hexagon are, on 
average, closer to the hexagons center than the points in an equal-area square are to 
the center of the square. Therefore, the colored hexagons represent the data slightly 
more accurately than the colored rectangles. Figure 18-7 shows the flight departure 
data with hexagon binning rather than rectangular binning. 
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Figure 18-7. Departure delay in minutes versus the flight departure time. Each colored 
hexagon represents all flights departing at that time with that departure delay. Coloring 
represents the number of flights represented by that hexagon. Data source: US Dept, of 
Transportation, Bureau of Transportation Statistics. 


Contour Lines 

Instead of binning data points into rectangles or hexagons, we can also estimate the 
point density across the plot area and indicate regions of different point densities 
with contour lines. This technique works well when the point density changes slowly 
across both the x and the y dimensions. 

As an example for this approach, we return to the blue jays dataset from Chapter 12. 
Figure 12-1 showed the relationship between head length and body mass for 123 blue 
jays, and there was some amount of overlap among the points. We can highlight the 
distribution of points more clearly by making the points smaller and partially trans¬ 
parent and plotting them on top of contour lines that delineate regions of similar 
point density (Figure 18-8). We can further enhance the perception of changes in the 
point density by shading the regions enclosed by the contour lines, using darker col¬ 
ors for regions representing higher point densities (Figure 18-9). 
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Figure 18-8. Head length versus body mass for 123 blue jays, as in Figure 12-1. Each dot 
corresponds to one bird, and the lines indicate regions of similar point density. The point 
density increases toward the center of the plot, near a body mass of 75 g and a head 
length between 55 mm and 57.5 mm. Data source: Keith Tarvin, Oberlin College. 
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Figure 18-9. Head length versus body mass for 123 blue jays. This figure is nearly identi¬ 
cal to Figure 18-8, but now the areas enclosed by the contour lines are shaded with 
increasingly darker shades of gray. This shading creates a stronger visual impression of 
increasing point density toward the center of the point cloud. Data source: Keith Tarvin, 
Oberlin College. 
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In Chapter 12, we also looked at the relationship between head length and body mass 
separately for male and female birds (Figure 12-2). We can do the same with contour 
lines, by drawing separately colored contour lines for male and female birds 
(Figure 18-10). 



Figure 18-10. Head length versus body mass for 123 blue jays. As in Figure 12-2, we can 
also indicate the birds’ sex by color when drawing contour lines. This figure highlights 
how the point distribution is different for male and female birds. In particular, male 
birds are more densely clustered in one region of the plot area whereas female birds are 
more spread out. Data source: Keith Tarvin, Oberlin College. 


Drawing multiple sets of contour lines in different colors can be a powerful strategy 
for showing the distributions of several point clouds at once. However, this technique 
needs to be employed with care. It only works when the number of groups with dis¬ 
tinct colors is small (two to three) and the groups are clearly separated. Otherwise, we 
may end up with a hairball of differently colored lines all crisscrossing each other and 
not showing any particular pattern at all. 

To illustrate this potential problem, I will employ the diamonds dataset, which con¬ 
tains information for 53,940 diamonds, including their price, weight (carat), and cut. 
Figure 18-11 shows this dataset as a scatterplot. The figure exhibits severe overplot¬ 
ting. There are so many different-colored points on top of one another that it is 
impossible to discern anything beyond the overall broad outline of where diamonds 
fall on the price-carat spectrum. 
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Figure 18-11. Price of diamonds versus their carat value, for 53,940 individual dia¬ 
monds. Each diamond’s cut is indicated by color. The plot is labeled as “bad” because the 
extensive overplotting makes it impossible to discern any patterns among the different 
diamond cuts. Data source: Hadley Wickham, ggplot2. 

We could try to draw colored contour lines for the different qualities of cut, as in 
Figure 18-10. However, in the diamonds dataset, we have five distinct colors and the 
groups strongly overlap. Therefore, the contour plot (Figure 18-12) is not much bet¬ 
ter than the original scatterplot (Figure 18-11). 
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Figure 18-12. Price of diamonds versus their carat value. As in Figure 18-11, but now 
individual points have been replaced by contour lines. The resulting plot is still labeled 
“bad,” because the contour lines all lie on top of each other. Neither the point distribution 
for individual cuts nor the overall point distribution can be discerned. Data source: Had¬ 
ley Wickham, ggplot2. 

What helps here is to draw the contour lines for each cut quality in its own plot panel 
(Figure 18-13). The purpose of drawing them all in one panel might be to enable vis¬ 
ual comparison between the groups, but Figure 18-12 is so busy that a comparison 
isn’t possible. Instead, in Figure 18-13, the background grid enables us to make com¬ 
parisons across cut qualities by paying attention to where exactly the contour lines 
fall relative to the grid lines. (A similar effect could have been achieved by plotting 
partially transparent individual points instead of contour lines in each panel.) 
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Figure 18-13. Price of diamonds versus their carat value. Here, we have taken the den¬ 
sity contours from Figure 18-12 and drawn them separately for each cut. We can now see 
that better cuts (very good, premium, ideal) tend to have lower carat values than the 
poorer cuts (fair, good) but command a higher price per carat. Data source: Hadley 
Wickham, ggplot2. 


We can now make out two main trends. First, the better cuts (very good, premium, 
ideal) tend to have lower carat values than the poorer cuts (fair, good). Recall that 
carat is a measure of diamond weight (1 carat = 0.2 grams). Better cuts tend to result 
(on average) in lighter diamonds because more material needs to be removed to cre¬ 
ate them. Second, at the same carat value, better cuts tend to command higher prices. 
To see this pattern, look for example at the price distribution for 0.5 carats. The distri¬ 
bution is shifted upwards for better cuts, and in particular it is substantially higher for 
diamonds with ideal cut than for diamonds with fair or good cut. 
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CHAPTER 19 


Common Pitfalls of Color Use 


Color can be an incredibly effective tool to enhance data visualizations. At the same 
time, poor color choices can ruin an otherwise excellent visualization. Color needs to 
be applied to serve a purpose, it must be clear, and it must not distract. 

Encoding Too Much or Irrelevant Information 

One common mistake is trying to give color a job that is too big for it to handle, by 
encoding too many different items in different colors. As an example, consider 
Figure 19-1. It shows population growth versus population size for all 50 US states 
and the District of Columbia. I have attempted to identify each state by giving it its 
own color. However, the result is not very useful. Even though we can guess which 
state is which by looking at the colored points in the plot and in the legend, it takes a 
lot of effort to go back and forth between the two to try to match them up. There are 
simply too many different colors, and many of them are quite similar to each other. 
Even if with a lot of effort we can figure out exactly which state is which, this visuali¬ 
zation defeats the purpose of coloring. We should use color to enhance figures and 
make them easier to read, not to obscure the data by creating visual puzzles. 
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Figure 19-1. Population growth from 2000 to 2010 versus population size in 2000, for all 
50 US states and the District of Columbia. Every state is marked in a different color. 
Because there are so many states, it is very difficult to match the colors in the legend to 
the dots in the scatterplot. Data source: US Census Bureau. 

As a rule of thumb, qualitative color scales work best when there are three to five dif¬ 
ferent categories that need to be colored. Once we reach 8 to 10 different categories or 
more, the task of matching colors to categories becomes too burdensome to be useful, 
even if the colors remain sufficiently different to be distinguishable in principle. For 
the dataset of Figure 19-1, it is probably best to use color only to indicate the geo¬ 
graphic region of each state and to identify individual states by direct labeling—i.e., 
by placing appropriate text labels adjacent to the data points (Figure 19-2). Even 
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though we cannot label every individual state without making the figure too crowded, 
direct labeling is the right choice for this figure. In general, for figures such as this 
one, we don’t need to label every single data point. It is sufficient to label a representa¬ 
tive subset, for example a set of states we specifically want to call out in the text that 
will accompany the figure. We always have the option to also provide the underlying 
data as a table if we want to make sure the reader has access to it in its entirety. 
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Figure 19-2. Population growth from 2000 to 2010 versus population size in 2000. In 
contrast to Figure 19-1, 1 have now colored states by region and have directly labeled a 
subset of states. I he majority of states have been left unlabeled to prevent overcrowding 
in the figure. Data source: US Census Bureau. 



Use direct labeling instead of colors when you need to distinguish 
between more than about eight categorical items. 


A second common problem is coloring for the sake of coloring, without having a 
clear purpose for the colors. As an example, consider Figure 19-3, which is a variation 
of Figure 4-2. However, now instead of coloring the bars by geographic regions, I 
have given each bar its own color, so that in aggregate the bars create a rainbow effect. 
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This may look like an interesting visual effect, but it is not creating any new insight 
into the data or making the figure easier to read. 
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Figure 19-3. Population growth in the US from 2000 to 2010. The rainbow coloring of 
states serves no purpose and is distracting. Furthermore, the colors are overly saturated. 
Data source: US Census Bureau. 


Besides the gratuitous use of different colors, Figure 19-3 has a second color-related 
problem: the chosen colors are too saturated and intense. This color intensity makes 
the figure difficult to look at. For example, it is difficult to read the names of the states 
without having our eyes drawn to the large, strongly colored areas right next to the 
state names. Similarly, it is difficult to compare the endpoints of the bars to the 
underlying grid lines. 
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Avoid large filled areas of overly saturated colors. They make it dif¬ 
ficult for your reader to carefully inspect your figure. 


Using Nonmonotonic Color Scales to Encode Data Values 

In Chapter 4, 1 listed two critical conditions for designing sequential color scales that 
can represent data values: the colors need to clearly indicate which data values are 
larger or smaller than which other ones, and the differences between colors need to 
visualize the corresponding differences between data values. Unfortunately, several 
existing color scales—including very popular ones—violate one or both of these con¬ 
ditions. The most popular such scale is the rainbow scale (Figure 19-4). It runs 
through all possible colors in the color spectrum. This means the scale is effectively 
circular; the colors at the beginning and the end are nearly the same (dark red). If 
these two colors end up next to each other in a plot, we do not instinctively perceive 
them as representing data values that are maximally apart. In addition, the scale is 
highly nonmonotonic. It has regions where colors change very slowly and others 
where colors change rapidly. This lack of monotonicity becomes particularly apparent 
if we look at the color scale converted to grayscale (Figure 19-4). The scale goes from 
medium dark to light to very dark and back to medium dark, and there are large 
stretches where lightness changes very little followed by relatively narrow stretches 
with large changes in lightness. 


rainbow scale 



rainbow converted to grayscale 



Figure 19-4. The rainbow color scale is highly nonmonotonic. This becomes apparent 
when the colors are converted to gray values. From left to right, the scale goes from mod¬ 
erately dark to light to very dark and back to moderately dark. In addition, the changes 
in lightness are nonuniform. The lightest part of the scale (corresponding to the colors 
yellow, light green, and cyan) takes up almost a third of the entire scale, while the dark¬ 
est part (corresponding to dark blue) is concentrated in a narrow region of the scale. 

In a visualization of actual data, the rainbow scale tends to obscure data features 
and/or highlight arbitrary aspects of the data (Figure 19-5). As an aside, the colors in 
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the rainbow scale are also overly saturated. Looking at Figure 19-5 for any extended 
period of time can be quite uncomfortable. 
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Figure 19-5. Percentage of people identifying as white in Texas counties. The rainbow 
color scale is not an appropriate scale to visualize continuous data values, because it 
tends to place emphasis on arbitrary features of the data. Here, it emphasizes counties in 
which approximately 75% of the population identify as white. Data source: 2010 US 
Decennial Census. 


Not Designing for Color-Vision Deficiency 

Whenever we are choosing colors for a visualization, we need to keep in mind that a 
good proportion of our readers may have some form of color-vision deficiency (i.e., 
are colorblind). These readers may not be able to distinguish colors that look clearly 
different to most other people. People with impaired color vision are not literally 
unable to see any colors, however. Instead, they will typically have difficulty distin¬ 
guishing certain types of colors, such as red and green (red-green color-vision defi¬ 
ciency) or blue and green (blue-yellow color-vision deficiency). The technical terms 
for these deficiencies are deuteranomaly/deuteranopia and protanomaly/protanopia 
for the red-green variant (where people have difficulty perceiving either green or red, 
respectively) and tritanomaly/tritanopia for the blue-yellow variant (where people 
have difficulty perceiving blue). The terms ending in “anomaly” refer to some impair¬ 
ment in the perception of the respective color, and the terms ending in “anopia” refer 
to complete absence of perception of that color. Approximately 8% of males and 0.5% 
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of females suffer from some sort of color-vision deficiency (CVD); deuteranomaly is 
the most common form whereas tritanomaly is relatively rare. 

As discussed in Chapter 4, there are three fundamental types of color scales used in 
data visualization: sequential scales, diverging scales, and qualitative scales. Of these 
three, sequential scales will generally not cause any problems for people with CVD, 
since a properly designed sequential scale should present a continuous gradient from 
dark to light colors. Figure 19-6 shows the Heat scale from Figure 4-3 in simulated 
versions of deuteranomaly, protanomaly, and tritanomaly. While none of these CVD- 
simulated scales look like the original, they all present a clear gradient from dark to 
light and they all work well to convey the magnitude of a data value. 


original deuteranomaly 



Figure 19-6. Color-vision deficiency simulation of the sequential color scale Heat, which 
runs from dark red to light yellow. From left to right and top to bottom, we see the origi¬ 
nal scale and the scale as seen under deuteranomaly, protanomaly, and tritanomaly sim¬ 
ulations. Even though the specific colors look different under the three types of CVD, in 
each case we can see a clear gradient from dark to light. Therefore, this color scale is safe 
to use for viewers with CVD. 

Things become more complicated for diverging scales, because popular color con¬ 
trasts can be indistinguishable to people with CVD. In particular, the colors red and 
green provide about the strongest contrast for people with normal color vision but 
become nearly indistinguishable for deutans (people with deuteranomaly) or protans 
(people with protanomaly) (Figure 19-7). Similarly, blue-green contrasts are visible 
for deutans and protans but become indistinguishable for tritans (people with trita¬ 
nomaly) (Figure 19-8). 

These examples might suggest that it is nearly impossible to find two contrasting col¬ 
ors that are safe under all forms of CVD. However, the situation is not that dire. It is 
often possible to make slight modifications to the colors such that they have the 
desired character while also being safe for viewers with CVD. For example, the Color- 
Brewer PiYG (pink to yellow-green) scale from Figure 4-5 looks red-green to people 
with normal color vision yet remains distinguishable for people with CVD 
(Figure 19-9). 
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Figure 19-7. A red-green contrast becomes 
(deuteranomaly or protanomaly). 
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Figure 19-8. A blue-green contrast becomes indistinguishable under blue-yellow CVD 
(tritanomaly). 
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Figure 19-9. The ColorBrewer PiYG (pink to yellow-green) scale from Figure 4-5 looks 
like a red-green contrast to people with regular color vision but works for people with all 
forms of color-vision deficiency. It works because the reddish color is actually pink (a 
mix of red and blue) while the greenish color also contains yellow. The difference in the 
blue component between the two colors can be picked up by deutans or protans, and the 
difference in the red component can be picked up by tritans. 

Things are most complicated for qualitative scales, because there we need many dif¬ 
ferent colors and they all need to be distinguishable from each other under all forms 
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of CVD. My preferred qualitative color scale, which I use extensively throughout this 
book, was developed specifically to address this challenge (Figure 19-10). By provid¬ 
ing eight different colors, the palette works for nearly any scenario with discrete col¬ 
ors. As discussed at the beginning of this chapter, you should probably not color-code 
more than eight different items in a plot anyway. 
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Figure 19-10. Qualitative color palette for all color-vision deficiencies [Okabe and Ito 
2008]. The alphanumeric codes represent the colors in RGB space, encoded as hexadeci¬ 
mals. In many plot libraries and image manipulation programs, you can just enter these 
codes directly. If your software does not take hexadecimals directly, you can also use the 
values in Table 19-1. 


Table 19-1. Colorblind-friendly color scale [Okabe and Ito 2008], 


1 Name 

Hex code 

Hue 

C, M, Y, K (%) 

R, G, B (0-255) 

R, G, B (%) 1 

Orange 

#E69F00 

41° 

0,50,100,0 

230,159,0 

90,60,0 

Sky blue 

#56B4E9 

202° 

80,0,0,0 

86,180,233 

35,70,90 

Bluish green 

#009E73 

164° 

97,0,75,0 

0,158,115 

0,60,50 

Yellow 

#F0E442 

56° 

10,5,90,0 

240,228,66 

95,90,25 

Blue 

#0072B2 

202° 

100,50,0,0 

0,114,178 

0,45,70 

Vermilion 

#D55E00 

27° 

0,80,100,0 

213,94,0 

80,40,0 

Reddish purple 

#CC79A7 

326° 

10,70,0,0 

204,121,167 

80,60,70 

Black 

#000000 

N/A 

0,0,0,100 

0,0,0 

0,0,0 


While there are several good, CVD-safe color scales readily available, we need to rec¬ 
ognize that they are not magic bullets. It is very possible to use a CVD-safe scale and 
yet produce a figure a person with CVD cannot decipher. One critical parameter is 
the size of the colored graphical elements. Colors are much easier to distinguish when 
they are applied to large areas than to small ones or thin lines [Stone, Albers Szafir, 
and Setlur 2014], and this effect is exacerbated under CVD (Figure 19-11). In addi¬ 
tion to the various color design considerations discussed in this chapter and in Chap¬ 
ter 4, I recommend to view color figures under CVD simulations to get a sense of 
what they may look like for a person with CVD. There are several online services and 
desktop apps available that allow you to run arbitrary figures through a CVD 
simulation. 
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Figure 19-11. Colored elements become difficult to distinguish at small sizes. The top-left 
panel (labeled “original”) shows four rectangles, four thick lines, four thin lines, and four 
groups of points, all colored in the same four colors. We can see that the colors become 
more difficult to distinguish the smaller or thinner the visual elements are. This problem 
becomes exacerbated in the CVD simulations, where the colors are already more difficult 
to distinguish even for the large graphical elements. 



To make sure your figures work for people with CVD, don’t just 
rely on specific color scales. Instead, test your figures in a CVD 
simulator. 
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CHAPTER 20 


Redundant Coding 


In Chapter 19, we saw that color cannot always convey information as effectively as 
we might wish. If we have many different items we want to identify, doing so by color 
may not work. It will be difficult to match the colors in the plot to the colors in the 
legend (Figure 19-1). And even if we only need to distinguish two or three different 
items, color may fail if the colored items are very small (Figure 19-11) and/or the col¬ 
ors look similar for people suffering from color-vision deficiency (Figures 19-7 and 
19-8). The general solution in all these scenarios is to use color to enhance the visual 
appearance of the figure without relying entirely on color to convey key information. 
I refer to this design principle as redundant coding, because it prompts us to encode 
data redundantly, using multiple different aesthetic dimensions. 

Designing Legends with Redundant Coding 

Scatterplots of several groups of data are frequently designed such that the points rep¬ 
resenting different groups differ only in their color. As an example, consider 
Figure 20-1, which shows the sepal width versus the sepal length of three different Iris 
species. (Sepals are the outer leaves of flowers in flowering plants.) The points repre¬ 
senting the different species differ in their colors, but otherwise all points look exactly 
the same. Even though this figure contains only three distinct groups of points, it is 
difficult to read even for people with normal color vision. The problem arises because 
the data points for the two species Iris virginica and Iris versicolor intermingle, and 
their two respective colors, green and blue, are not particularly distinct from each 
other. 
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Figure 20-1. Sepal width versus sepal length for three different Iris species (Iris setosa, 
Iris virginica, and Iris versicolor). Each point represents the measurements for one 
plant sample. A small amount of jitter has been applied to all point positions to prevent 
overplotting. The figure is labeled “bad” because the virginica points in green and the 
versicolor points in blue are difficult to distinguish from each other. Data source: [Fisher 
1936]. 


Surprisingly, the green and blue points look more distinct for people with red-green 
color-vision deficiency (deuteranomaly or protanomaly) than for people with normal 
color vision (compare Figure 20-2, top row, to Figure 20-1). On the other hand, for 
people with blue-yellow deficiency (tritanomaly), the blue and green points look very 
similar (Figure 20-2, bottom left). And if we print out the figure in grayscale (i.e., we 
desaturate the figure), we cannot distinguish any of the Iris species (Figure 20-2, bot¬ 
tom right). 
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Figure 20-2. Color-vision deficiency simulation of Figure 20-1. Data source: [Fisher 
1936]. 


There are two simple improvements we can make to Figure 20-1 to alleviate these 
issues. First, we can swap the colors used for Iris setosa and Iris versicolor, so that the 
blue is no longer directly next to the green (Figure 20-3). Second, we can use three 
different symbol shapes, so that the points all look different. With these two changes, 
both the original version of the figure (Figure 20-3) and the versions under color- 
vision deficiency and in grayscale (Figure 20-4) become legible. 
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Figure 20-3. Sepal width versus sepal length for three different Iris species. Compared to 
Figure 20-1, we have swapped the colors for Iris setosa and Iris versicolor and we have 
given each Iris species its own point shape. Data source: [Fisher 1936]. 
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Figure 20-4. Color-vision deficiency simulation of Figure 20-3. Because of the use of dif¬ 
ferent point shapes, even the fully desaturated grayscale version of the figure is legible. 
Data source: [Fisher 1936]. 


Changing the point shape is a simple strategy for scatterplots, but it doesn’t necessar¬ 
ily work for other types of plots. In line plots, we could change the line type (solid, 
dashed, dotted, etc.; see also Figure 2-1), but using dashed or dotted lines often yields 
sub-optimal results. In particular, dashed or dotted lines usually don’t look good 
unless they are perfectly straight or only gently curved, and in either case they create 
visual noise. Also, it frequently requires significant mental effort to match different 
types of dash or dot-dash patterns from the plot to the legend. So what do we do with 
a visualization such as Figure 20-5, which uses lines to show the change in stock price 
over time for four different major tech companies? 
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Figure 20-5. Stock price over time for four major tech companies. The stock price for 
each company has been normalized to equal 100 in June 2012. This figure is labeled as 
“bad” because it takes considerable mental energy to match the company names in the 
legend to the data curves. Data source: Yahoo! Finance. 

The figure contains four lines representing the stock prices of the four different com¬ 
panies. The lines are color-coded using a colorblind-friendly color scale. Thus, it 
should be relatively straightforward to associate each line with the corresponding 
company—yet it is not. The problem here is that the data lines have a visual order. 
The yellow line, representing Facebook, is perceived as the highest line, and the black 
line, representing Apple, is perceived as the lowest, with Alphabet and Microsoft in 
between, in that order. Yet the order of the four companies in the legend is Alphabet, 
Apple, Facebook, Microsoft (alphabetical order). Thus, the perceived order of the 
data lines differs from the order of the companies in the legend, and it takes a surpris¬ 
ing amount of mental effort to match data lines with company names. 

This problem arises commonly with plotting software that autogenerates legends. 
The plotting software has no concept of the visual order the viewer will perceive. 
Instead, the software sorts the legend by some other order, most commonly alphabet¬ 
ical. We can fix this problem by manually reordering the entries in the legend so they 
match the perceived ordering in the data (Figure 20-6). The result is a figure that 
makes it much easier to match the legend to the data. 
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Figure 20-6. Stock price over time for four major tech companies. Compared to 
Figure 20-5, the entries in the legend have now been ordered such that they match the 
perceived visual order of the data lines, with Facebook the highest and Apple the lowest. 
Data source: Yahoo! Finance. 



If there is a visual ordering in your data, make sure to match it in 
the legend. 


Matching the legend order to the data order is always helpful, but the benefits are 
particularly obvious under color-vision deficiency simulation (Figure 20-7). For 
example, it helps in the tritanomaly version of the figure, where the blue and the 
green become difficult to distinguish (Figure 20-7, bottom left). It also helps in the 
grayscale version (Figure 20-7, bottom right). Even though the two colors for Face- 
book and Alphabet have virtually the same gray value, we can see that Microsoft and 
Apple are represented by darker colors and take the bottom two spots. Therefore, we 
correctly assume that the highest line corresponds to Facebook and the second- 
highest line to Alphabet. 
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Figure 20-7. Color-vision deficiency simulation of Figure 20-6. Data source: Yahoo! 
Finance. 

Designing Figures Without Legends 

Even though legend legibility can be improved by encoding data redundantly, in mul¬ 
tiple aesthetics, legends always put an extra mental burden on the reader. In reading a 
legend, the reader needs to pick up information in one part of the visualization and 
then transfer it over to a different part. We can typically make our readers’ lives easier 
if we eliminate the legend altogether. Eliminating the legend does not mean, however, 
that we simply don’t provide one and instead write sentences such as “The yellow 
dots represent Iris versicolor” in the figure caption. Eliminating the legend means that 
we design the figure in such a way that it is immediately obvious what the various 
graphical elements represent, even if no explicit legend is present. 

The general strategy we can employ is called direct labeling, whereby we incorporate 
appropriate text labels or other visual elements that serve as guideposts to the rest of 
the figure. We have previously encountered direct labeling in Chapter 19 
(Figure 19-2), as an alternative to drawing a legend with over 50 distinct colors. To 
apply the direct labeling concept to the stock price figure, we place the name of each 
company right next to the end of its respective data line (Figure 20-8). 
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Figure 20-8. Stock price over time for four major tech companies. The stock price for 
each company has been normalized to equal 100 in June 2012. Data source: Yahoo! 
Finance. 



Whenever possible, design your figures so they don’t need a sepa¬ 
rate legend. 


We can also apply the direct labeling concept to the Iris data from the beginning of 
this chapter, specifically Figure 20-3. Because it is a scatterplot of many points that 
separate into three different groups, we need to directly label the groups rather than 
the individual points. One solution is to draw ellipses that enclose the majority of the 
points and then label the ellipses (Figure 20-9). 

For density plots, we can similarly direct-label the curves rather than providing a 
color-coded legend (Figure 20-10). In both Figures 20-9 and 20-10, 1 have colored the 
text labels in the same colors as the data. Colored labels can greatly enhance the direct 
labeling effect, but they can also turn out poorly. If the text labels are printed in a 
color that is too light, then the labels become difficult to read. And because text con¬ 
sists of very thin lines, colored text often appears to be lighter than an adjacent filled 
area of the same color. I generally circumvent these issues by using two different 
shades of each color, a light one for filled areas and a dark one for lines, outlines, and 
text. If you carefully inspect Figure 20-9 or 20-10, you will see how each data point or 
shaded area is filled with a light color and has an outline drawn in a darker color of 
the same hue. The text labels are drawn in the same darker colors. 
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Figure 20-9. Sepal width versus sepal length for three different Iris species. The points 
representing different Iris species have been directly labeled with colored ellipses and text 
labels. Compared to Figure 20-3, 1 have removed the background grid here because the 
figure was becoming too busy. Data source: [Fisher 1936]. 

1.5 



sepal length 

Figure 20-10. Density estimates of the sepal lengths of three different Iris species. Each 
density estimate is directly labeled with the respective species name. Data source: [Fisher 
1936]. 

We can also use density plots such as the one in Figure 20-10 as a legend replacement, 
by placing the density plots into the margins of a scatterplot (Figure 20-11). This 
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allows us to direct-label the marginal density plots rather than the central scatterplot 
and hence results in a figure that is somewhat less cluttered than Figure 20-9 with its 
directly labeled ellipses. 



sepal length 

Figure 20-11. Sepal width versus sepal length for three different Iris species, with 
marginal density estimates of each variable for each species. Data source: [Fisher 1936]. 

And finally, whenever we encode a single variable in multiple aesthetics, we don’t 
normally want multiple separate legends for the different aesthetics. Instead, there 
should be a single legend-like visual element that conveys all the mappings at once. In 
the case where we map the same variable onto a position along a major axis and onto 
color, this implies that the reference color bar should run along and be integrated into 
the same axis. Figure 20-12 shows a case where we map temperature to both a posi¬ 
tion along the x axis and to color, and where we therefore have integrated the color 
legend into the x axis. 
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Figure 20-12. Temperatures in Lincoln, NE, in 2016. This figure is a variation of 
Figure 9-9. Temperature is now shown both by location along the x axis and by color, 
and a color bar along the x axis visualizes the scale that converts temperatures into col¬ 
ors. Data source: Weather Underground. 



254 | Chapter 20: Redundant Coding 




CHAPTER 21 


Multipanel Figures 


When datasets become large and complex, they often contain much more informa¬ 
tion than can reasonably be shown in a single figure panel. To visualize such datasets, 
it can be helpful to create multipanel figures. These are figures that consist of multiple 
figure panels where each one shows some subset of the data. There are two distinct 
categories of such figures, small multiples and compound figures. Small multiples are 
plots consisting of multiple panels arranged in a regular grid. Each panel shows a dif¬ 
ferent subset of the data but all panels use the same type of visualization. Compound 
figures consist of separate figure panels assembled in an arbitrary arrangement 
(which may or may not be grid-based) and showing entirely different visualizations, 
or possibly even different datasets. 

We have encountered both types of multipanel figures in many places throughout this 
book. In general, these figures are intuitive and straightforward to interpret. How¬ 
ever, when preparing such figures, there are a few issues we need to pay attention to, 
such as appropriate axis scaling, alignment, and consistency between separate panels. 

Small Multiples 

The term “small multiple” was popularized by [Tufte 1990]. An alternative term, “trel¬ 
lis plot,” was popularized around the same time by Cleveland, Becker, and colleagues 
at Bell Labs ([Cleveland 1993]; [Becker, Cleveland, and Shyu 1996]). Regardless of the 
terminology, the key idea is to slice the data into parts according to one or more data 
dimensions, visualize each data slice separately, and then arrange the individual visu¬ 
alizations into a grid. Columns, rows, or individual panels in the grid are labeled by 
the values of the data dimensions that define the data slices. More recently, this tech¬ 
nique is also sometimes referred to as “faceting,” named after the methods that create 
such plots in the widely used ggplot2 plot library (e.g., the ggplot2 function 
facet_grid()) [Wickham 2016]. 
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As a first example, we will apply this technique to the dataset of Titanic passengers. 
We can subdivide this dataset by the class in which each passenger traveled and by 
whether a passenger survived or not. Within each of these six slices of data, there are 
both male and female passengers, and we can visualize their numbers using bars. The 
result is six bar plots, which we arrange in two columns (one for passengers who died 
and one for those who survived) of three rows (one for each class) (Figure 21-1). The 
columns and rows are labeled, so it is immediately obvious which of the six plots cor¬ 
responds to each combination of survival status and class. 


died survived 



female male female male 


Figure 21-1. Breakdown of passengers on the Titanic by gender, survival, and class in 
which they traveled (1st, 2nd, or 3rd). Data source: Encyclopedia Titanica. 

This visualization provides an intuitive and interpretable visualization of the fate of 
Titanic’s passengers. We see immediately that most men died and most women sur¬ 
vived. Further, among the women who died nearly all were traveling in third class. 

Small multiples are a powerful tool to visualize very large amounts of data at once. 
Figure 21-1 uses six separate panels, but we can use many more. Figure 21-2 shows 
the relationship between the average ranking of a movie on the Internet Movie Data¬ 
base (IMDB) and the number of votes the movie has received, separately for movies 
released over a 100-year time period. Here, the dataset is sliced by only one dimen¬ 
sion, the year, and panels for each year are arranged in rows from top left to bottom 
right. This visualization shows that there is an overall relationship between average 
ranking and number of votes, such that movies with more votes tend to have higher 
rankings. However, the strength of this trend varies with year, and for movies released 
in the early 2000s there is no relationship or even a negative one. 
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Figure 21-2. Average movie rankings versus number of votes, for movies released from 
1906 to 2005. Blue dots represent individual movies, and orange lines represent the lin¬ 
ear regression of the average ranking of each movie versus the logarithm of the number 
of votes the movie has received. In most years, movies with a higher number of votes 
have, on average, a higher average ranking. However, this trend weakened toward the 
end of the 20th century, and a negative relationship can be seen for movies released in 
the early 2000s. Data source: IMDB. 

For such large plots to be easily understandable, it is important that each panel uses 
the same axis ranges and scaling. The human mind expects this to be the case. When 
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proportion of degrees 


it is not, there is a good chance that a reader will misinterpret what the figure shows. 
For example, consider Figure 21-3, which presents how the proportion of bachelors 
degrees awarded in different degree areas has changed over time. The figure shows 
each of the nine degree areas that have represented, on average, more than 4% of all 
degrees awarded between 1971 and 2015. They axis of each panel is scaled such that 
the curve for each degree field covers the entire y-axis range. As a consequence, a cur¬ 
sory examination of Figure 21-3 suggests that the nine degree areas are all equally 
popular and have all experienced variation in popularity of a similar magnitude. 
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Figure 21-3. Trends in bachelor’s degrees conferred by US institutions of higher learning. 
Shown are all degree areas that represent, on average, more than 4% of all degrees. This 
figure is labeled as “bad” because all panels use different y- axis ranges. This choice 
obscures the relative sizes of the different degree areas and it overexaggerates the changes 
that have happened in some of the degree areas. Data source: National Center for Edu¬ 
cation Statistics. 
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Placing all the panels onto the same y axis reveals, however, that this interpretation is 
misleading (Figure 21-4). Some degree areas are much more popular than others, and 
similarly some areas have grown or shrunk in popularity much more than others. For 
example, degrees in education have declined sharply, whereas the proportion of visual 
and performing arts degrees awarded has remained approximately constant or maybe 
seen a small increase. 
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Figure 21-4. Trends in bachelor’s degrees conferred by US institutions of higher learning. 
Shown are all degree areas that represent, on average, more than 4% of all degrees. Data 
source: National Center for Education Statistics. 


I generally recommend against using different axis scalings in separate panels of a 
small multiples plot. However, on occasion, this problem truly cannot be avoided. If 
you encounter such a scenario, then I think at a minimum you need to draw the read¬ 
er’s attention to this issue in the figure caption. For example, you could add a sen¬ 
tence such as: “Notice that the y-axis scalings differ among the different panels of this 
figure.” 
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It is also important to think about the ordering of the individual panels in a small 
multiples plot. The plot will be easier to interpret if the ordering follows some logical 
principle. In Figure 21-1, 1 arranged the rows from the highest class (first class) to the 
lowest class (third class). In Figure 21-2, I arranged the panels by increasing years 
from the top left to the bottom right. In Figure 21-4, 1 arranged the panels by decreas¬ 
ing average degree popularity, such that the most popular degrees are in the top row 
and/or to the left and the least popular degrees are in the bottom row and/or to the 
right. 



Always arrange the panels in a small multiples plot in a meaningful 
and logical order. 


Compound Figures 

Not every figure with multiple panels fits the pattern of small multiples. Sometimes 
we simply want to combine several independent panels into a figure that conveys one 
overarching point. In this case, we can take the individual plots and arrange them in 
rows, columns, or other more complex arrangements, and call the entire arrangement 
one figure. For an example, see Figure 21-5, which continues the analysis of trends in 
bachelor’s degrees conferred by US institutions of higher learning. Panel (a) of 
Figure 21-5 shows the growth in total number of degrees awarded from 1971 to 2015, 
a time span during which the number approximately doubled. Panel (b) instead 
shows the change in the percent of degrees awarded over the same time period in the 
five most popular degree areas. We can see that social sciences, history, and education 
have experienced massive declines from 1971 to 2015, whereas business and health 
professions have seen substantial growth. 

Notice how unlike in my small multiples examples, the individual panels of the com¬ 
pound figure are labeled alphabetically. It is conventional to use lower- or uppercase 
letters from the Latin alphabet for this labeling, which is needed to uniquely specify a 
particular panel. For example, when I want to talk about the part of Figure 21-5 
showing the changes in percent of degrees awarded, I can refer to panel (b) of that 
figure or simply to Figure 21 -5b. Without labeling, I would have to awkwardly talk 
about the “right panel” or the “left panel” of Figure 21-5, and referring to specific 
panels would be even more awkward for more complex panel arrangements. Labeling 
is not needed and not normally done for small multiples because there each panel is 
uniquely specified by the faceting variable(s) that are provided as figure labels. 
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Figure 21-5. Trends in bachelor’s degrees conferred by US institutions of higher learning, 
(a) From 1970 to 2015, the total number of degrees awarded nearly doubled, (b) Among 
the most popular degree areas, social sciences, history, and education experienced a 
major decline, while business and health professions grew in popularity. Data source: 
National Center for Education Statistics. 


When labeling the different panels of a compound figure, pay attention to how the 
labels fit into the overall figure design. I often see figures where the labels look like 
they were slapped on after the fact by a different person. It’s not uncommon to see 
labels made overly large and prominent, placed in an awkward location, or typeset in 
a different font than the rest of the figure. (See Figure 21-6 for an example.) The 
labels should not be the first thing you see when you look at a compound figure. In 
fact, they don’t need to stand out at all. We generally know which figure panel has 
which label, since the convention is to start in the top-left corner with “a” and label 
consecutively from left to right and top to bottom. I think of these labels as equivalent 
to page numbers. You don’t normally read the page numbers, and there is no surprise 
in which page has which number, but on occasion it can be helpful to use page num¬ 
bers to refer to a particular place in a book or article. 

We also need to pay attention to how the individual panels of a compound figure fit 
together. It is possible to make a set of figure panels that individually are fine but 
jointly don’t work. In particular, we need to employ a consistent visual language. By 
“visual language,” I mean the colors, symbols, fonts, and so on that we use to display 
the data. Keeping the language consistent means, in a nutshell, that the same things 
look the same or at least substantively similar across figures. 
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Figure 21-6. Variation of Figure 21-5 with poor labeling. The panel labels are too large 
and thick, they are in the wrong font, and they are placed in an awkward location. Also, 
while labeling with capital letters is fine and is in fact quite common, labeling needs to be 
consistent across all figures in a document. In this book, the convention is that multipa¬ 
nel figures use lowercase labels, and thus this figure is inconsistent with the other figures 
in this book. Data source: National Center for Education Statistics. 

Let’s look at an example that violates this principle. Figure 21-7 is a three-panel figure 
visualizing a dataset about the physiology and body composition of male and female 
athletes. Panel (a) shows the number of men and women in the dataset, panel (b) 
shows the counts of red and white blood cells for men and women, and panel (c) 
shows the body fat percentages of men and women, broken down by sport. Each 
panel individually is an acceptable figure. However, in combination the three panels 
do not work, because they don’t share a common visual language. First, panel (a) uses 
the same blue color for both male and female athletes, panel (b) uses it only for male 
athletes, and panel (c) uses it for female athletes. Moreover, panels (b) and (c) intro¬ 
duce additional colors, but these colors differ between the two panels. It would have 
been better to use the same two colors consistently for male and female athletes, and 
to apply the same coloring scheme to panel (a) as well. Second, in panels (a) and (b) 
women are on the left and men on the right, but in panel (c) the order is reversed. 
The order of the boxplots in panel (c) should be switched so it matches panels (a) 
and (b). 
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Figure 21-7. Physiology and body composition of male and female athletes, (a) The data¬ 
set encompasses 73 female and 85 male professional athletes, (b) Male athletes tend to 
have higher red blood cell (RBC, reported in units of 10 12 per liter) counts than female 
athletes, but there are no such differences for white blood cell counts (WBC, reported in 
units of 10 9 per liter), (c) Male athletes tend to have a lower body fat percentage than 
female athletes performing in the same sport. This figure is labeled “bad” because parts 
(a), (b), and (c) do not use a consistent visual language. Data source: [Telford and Cun¬ 
ningham 1991]. 

Figure 21-8 fixes all these issues. In this figure, female athletes are consistently shown 
in orange and to the left of male athletes, who are shown in blue. Notice how much 
easier it is to read this figure than Figure 21-7. When we use a consistent visual lan¬ 
guage, it doesn’t take much mental effort to determine which visual elements in the 
different panels represent women and which men. Figure 21-7, on the other hand, 
can be quite confusing. In particular, on first glance it may generate the impression 
that men tend to have higher body fat percentages than women. Notice also that we 
need only a single legend in Figure 21-8 but needed two in Figure 21-7. Since the vis¬ 
ual language is consistent, the same legend works for panels (b) and (c). 
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Figure 21-8. Physiology and body composition of male and female athletes. This figure 
shows the exact same data as Figure 21-7, but now uses a consistent visual language. 
Data for female athletes is always shown to the left of the corresponding data for male 
athletes, and genders are consistently color-coded throughout all elements of the figure. 
Data source: [Telford and Cunningham 1991]. 

Finally, we need to pay attention to the alignment of individual figure panels in a 
compound figure. The axes and other graphical elements of the individual panels 
should all be aligned to each other. Getting the alignment right can be quite tricky, in 
particular if individual panels are prepared separately, possibly by different people 
and/or in different programs, and then pasted together in an image manipulation 
program. To draw your attention to such alignment issues, Figure 21-9 shows a varia¬ 
tion of Figure 21-8 where now all figure elements are slightly out of alignment. I have 
added axis lines to all panels of Figure 21-9 to emphasize these alignment problems. 
Notice how no axis line is aligned with any other axis line for any other panel of the 
figure. 
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Figure 21-9. Variation of Figure 21-8 where all figure panels are slightly misaligned. 
Misalignments are ugly and should be avoided. Data source: [Telford and Cunningham 
1991]. 
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CHAPTER 22 


Titles, Captions, and Tables 


A data visualization is not a piece of art meant to be looked at only for its aesthetically 
pleasing features. Instead, its purpose is to convey information and make a point. To 
reliably achieve this goal when preparing visualizations, we have to place the data into 
context and provide accompanying titles, captions, and other annotations. In this 
chapter, I will discuss how to properly title and label figures. I will also discuss how to 
present data in table form. 

Figure Titles and Captions 

One critical component of every figure is the title. Every figure needs a title. The job 
of the title is to accurately convey to the reader what the figure is about, what point it 
makes. However, the figure title may not necessarily appear where you were expecting 
to see it. Consider Figure 22-1. Its title is “Corruption and human development: the 
most developed countries experience the least corruption.” This title is not shown 
above the figure. Instead, the title is provided as the first part of the caption block, 
underneath the figure display. This is the style I am using throughout this book. I 
consistently show figures without integrated titles and with separate captions. (An 
exception are the stylized plot examples in Chapter 5, which instead have titles and 
no captions.) 
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Corruption Perceptions Index, 2015 (100 = least corrupt) 


Figure 22-1. Corruption and human development: the most developed countries experi¬ 
ence the least corruption. Original figure concept: [The Economist online 2011]. Data 
sources: Transparency International & UN Human Development Report. 


Alternatively, I could incorporate the figure title—as well as other elements of the 
caption, such as the data source statement—into the main display (Figure 22-2). In a 
direct comparison, you may find Figure 22-2 more attractive than Figure 22-1, and 
you may wonder why I choose to use the latter style throughout this book. I do so 
because the two styles have different application areas, and figures with integrated 
titles are not appropriate for conventional book layouts. The underlying principle is 
that a figure can have only one title. Either the title is integrated into the actual figure 
display or it is provided as the first element of the caption underneath the figure. And 
if a publication is laid out such that each figure has a regular caption block under¬ 
neath the display item, then the title must be provided in that block of text. For this 
reason, in the context of conventional book or article publishing, we do not normally 
integrate titles into figures. Figures with integrated titles, subtitles, and data source 
statements are appropriate, however, if they are meant to be used as standalone info¬ 
graphics or to be posted on social media or on a web page without accompanying 
caption text. 
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Corruption Perceptions Index, 2015 (100 = least corrupt) 


Data sources: Transparency International & UN Human Development Report 


Figure 22-2. Infographic version of Figure 22-1. The title, subtitle, and data source state¬ 
ments have been incorporated into the figure. This figure could be posted on the web as 
is or otherwise used without a separate caption block. 



If your document layout uses caption blocks underneath each fig¬ 
ure, then place the figure titles as the first element of each caption 
block, not on top of the figures. 


One of the most common mistakes I see in figure captions is the omission of a proper 
figure title as the first element of the caption. Take a look back at the caption to 
Figure 22-1. It begins with “Corruption and human development.” It does not begin 
with “This figure shows how corruption is related to human development.” The first 
part of the caption is always the title, not a description of the contents of the figure. A 
title does not have to be a complete sentence, though short sentences making a clear 
assertion can serve as titles. For example, for Figure 22-1, a title such as “The most 
developed countries are the least corrupt” would have worked fine. 
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Axis and Legend Titles 

Just like every plot needs a title, axes and legends need titles as well. (Axis titles are 
often colloquially referred to as axis labels.) Axis and legend titles and labels explain 
what the displayed data values are and how they map to plot aesthetics. 

To present an example of a plot where all axes and legends are appropriately labeled 
and titled, I have taken the blue jay dataset discussed at length in Chapter 12 and 
visualized it as a bubble plot (Figure 22-3). In this plot, the axis titles indicate that the 
x axis shows body mass in grams and the y axis shows head length in millimeters. 
Similarly, the legend titles show that point coloring indicates the birds’ sex and point 
size indicates the birds’ skull size in millimeters. I emphasize that for all numerical 
variables (body mass, head length, and skull size) the relevant titles not only state the 
variables shown but also the units in which the variables are measured. This is good 
practice and should be done whenever possible. Categorical variables (such as sex) do 
not require units. 
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Figure 22-3. Head length versus body mass for 123 blue jays. The birds’ sex is indicated 
by color, and the birds’ skull size by symbol size. Head length measurements include the 
length of the bill while skull size measurements do not. Data source: Keith Tarvin, Ober- 
lin College. 


There are cases, however, when axis or legend titles can be omitted, namely when the 
labels themselves are fully explanatory. For example, a legend showing two differently 
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colored dots labeled “female” and “male” already indicates that color encodes sex. The 
title “sex” is not required to clarify this fact, and indeed throughout this book I have 
often omitted the legend title for legends indicating sex or gender (see e.g., Figures 
6-10, 12-2, or 21-1). Similarly, country names will generally not require a title stating 
what they are (Figure 6-11), nor will movie titles (Figure 6-1) or years (Figure 22-4). 



Facebook 

Alphabet 

Microsoft 

Apple 


0 

2013 2014 2015 2016 2017 

Figure 22-4. Stock price over time for four major tech companies. The stock price for 
each company has been normalized to equal 100 in June 2012. This figure is a slightly 
modified version of Figure 20-6 in Chapter 20. Here, the x axis representing time does 
not have a title. It is obvious from the context that the numbers 2013, 2014, etc. refer to 
years. Data source: Yahoo! Finance. 

However, we have to be careful when omitting axis or legend titles, because it is easy 
to misjudge what is and isn’t obvious from the context. I frequently see graphs in the 
popular press that push omitting axis titles to a point that would make me uncom¬ 
fortable. For example, some publications might produce a figure such as Figure 22-5, 
assuming that the meaning of the axes is obvious from the plot title and subtitle 
(here: “stock price over time for four major tech companies” and “the stock price for 
each company has been normalized to equal 100 in June 2012”). I disagree with the 
perspective that the context defines the axes. Because a caption typically doesn’t 
include words such as “the x/y axis shows,” some amount of guesswork is always 
required to interpret the figure. In my own experience, figures without properly 
labeled axes tend to leave me with a nagging feeling of uncertainty—even if I’m 95% 
certain I understand what is shown, I don’t feel 100% certain. As a general principle, I 
think it is a bad practice to make your readers guess what you mean. Why would you 
want to create a feeling of uncertainty in your readers? 


Axis and Legend Titles | 271 




bad 


Facebook 

Alphabet 

Microsoft 

Apple 


0 

2013 2014 2015 2016 2017 

Figure 22-5. Stock price over time for four major tech companies. The stock price for 
each company has been normalized to equal 100 in June 2012. This variant of 
Figure 22-4 has been labeled as “bad” because the y axis now does not have a title either, 
and what the values shown along the y axis represent is not immediately obvious from 
the context. Data source: Yahoo! Finance. 

On the flip side, we can overdo the labeling. If the legend lists the names of four well- 
known companies, the legend title “company” is redundant and doesn’t add anything 
useful (Figure 22-6). Similarly, even though we generally should report units for all 
quantitative variables, if the x axis shows a few recent years, titling it as “time (years 
AD)” is awkward. 

Finally, in some cases it is acceptable to omit not only the axis title but the entire axis. 
Pie charts typically don’t have explicit axes (e.g., Figure 10-1), and neither do tree- 
maps (Figure 11-4). Mosaic plots or bar charts can be shown without one or both 
axes if the meaning of the plot is otherwise clear (Figures 6-10 and 11-3). Omitting 
explicit axes with axis ticks and tick labels signals to the reader that the qualitative 
features of the graph are more important than the specific data values. 


272 | Chapter 22: Titles, Captions, and Tables 




ugly 



company 

— Facebook 
Alphabet 

— Microsoft 

— Apple 


0 

2013 2014 2015 2016 2017 

time (years AD) 


Figure 22-6. Stock price over time for four major tech companies. The stock price for 
each company has been normalized to equal 100 in June 2012. This variant of 
Figure 22-4 has been labeled as “ugly” because it is labeled excessively. In particular, pro¬ 
viding a unit (“years AD”) for the values along the x axis is awkward and unnecessary. 
Data source: Yahoo! Finance. 


Tables 

Tables are an important tool for visualizing data. Yet because of their apparent sim¬ 
plicity, they may not always receive the attention they deserve. I have shown a hand¬ 
ful of tables throughout this book; for example, Tables 6-1, 7-1, and 19-1. Take a 
moment and locate these tables, look at how they are formatted, and compare them 
to a table you or a colleague has recently made. In all likelihood, there are important 
differences. In my experience, absent proper training in table formatting, few people 
will instinctively make the right formatting choices. In self-published documents, 
poorly formatted tables are even more prevalent than poorly designed figures. Also, 
most software commonly used to create tables provides defaults that are not recom¬ 
mended. For example, my version of Microsoft Word provides 105 predefined table 
styles, and of these at least 70 or 80 violate some of the table rules I’m going to discuss 
here. So, if you pick a Microsoft Word table layout at random, you have an approxi¬ 
mately 80% chance of picking one that has issues. And if you pick the default, you 
will end up with a poorly formatted table every time. 

Some key rules for table layout are the following: 
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1. Do not use vertical lines. 

2. Do not use horizontal lines between data rows. (Horizontal lines as a separator 
between the title row and the first data row or as a frame for the entire table are 
fine.) 

3. Text columns should be left aligned. 

4. Number columns should be right aligned and should use the same number of 
decimal digits throughout. 

5. Columns containing single characters should be centered. 

6. The header fields should be aligned with their data; i.e., the heading for a text col¬ 
umn will be left aligned and the heading for a number column will be right 
aligned. 

Figure 22-7 reproduces Table 6-1 in four different ways, two of which (a, b) violate 
several of these rules and two of which (c, d) do not. 


a ugly 


Rank 

Title 

Amount 

1 

Star Wars: The Last Jedi 

$71,565,498 

2 

Jumanji: Welcome to the Jungle 

$36,169,328 

3 

Pitch Perfect 3 

$19,928,525 

4 

The Greatest Showman 

$8,805,843 

5 

Ferdinand 

$7,316,746 

c 

Rank 

Title 

Amount 

1 

Star Wars: The Last Jedi 

$71,565,498 

2 

Jumanji: Welcome to the Jungle 

$36,169,328 

3 

Pitch Perfect 3 

$19,928,525 

4 

The Greatest Showman 

$8,805,843 

5 

Ferdinand 

$7,316,746 


b ugly 


Rank 

Title 

Amount 

1 

Star Wars: The LastJedi 

$71,565,498 

2 

Jumanji: Welcome to the Jungle 

$36,169,328 

3 

Pitch Perfect 3 

$19,928,525 

4 

The Greatest Showman 

$8,805,843 

5 

Ferdinand 

$7,316,746 


d 


Rank 

Title 

Amount 

1 

Star Wars: The LastJedi 

$71,565,498 

2 

Jumanji: Welcome to the Jungle 

$36,169,328 

3 

Pitch Perfect 3 

$19,928,525 

4 

The Greatest Showman 

$8,805,843 

5 

Ferdinand 

$7,316,746 


Figure 22-7. Examples of poorly and appropriately formatted tables, using the data from 
Table 6-1 in Chapter 6. (a) This table violates numerous conventions of proper table for¬ 
matting, including using vertical lines, using horizontal lines between data rows, and 
using centered data columns, (b) This table suffers from all problems of (a), and also cre¬ 
ates visual noise by alternating between very dark and very light rows. Also, the table 
header is not strongly visually separated from the table body, (c) This is an appropriately 
formatted table with a minimal design, (d) Colors can be used effectively to group data 
into rows, but the color differences should be subtle. The table header can be set off by 
using a stronger color. Data source: Box Office Mojo. Used with permission. 
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When authors draw tables with horizontal lines between data rows, the intent is usu¬ 
ally to help the eye follow the individual rows. However, unless the table is very wide 
and sparse, this visual aid is not normally needed. We don’t draw horizontal lines 
between rows in a piece of regular text either. The cost of horizontal (or vertical) lines 
is visual clutter. Compare parts (a) and (c) of Figure 22-7. Part (c) is much easier to 
read than part (a). If we feel that a visual aid separating table rows is necessary, then 
alternating lighter and darker shading of rows tends to work well without creating 
much clutter (Figure 22-7d). 

Finally, there is a key distinction between figures and tables in where the caption is 
located relative to the display item. For figures, it is customary to place the caption 
underneath, whereas for tables it is customary to place it above. This caption place¬ 
ment is guided by the way in which readers process figures and tables. For figures, 
readers tend to first look at the graphical display and then read the caption for con¬ 
text, hence the caption makes sense below the figure. By contrast, tables tend to be 
processed like text, from top to bottom, and reading the table contents before reading 
the caption will frequently not be useful. Hence, captions are placed above the table. 
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CHAPTER 23 


Balance the Data and the Context 


We can broadly subdivide the graphical elements in any visualization into elements 
that represent data and elements that do not. The former are elements such as the 
points in a scatterplot, the bars in a histogram or bar plot, or the shaded areas in a 
heatmap. The latter are elements such as plot axes, axis ticks and labels, axis titles, leg¬ 
ends, and plot annotations. These elements generally provide context for the data 
and/or visual structure to the plot. When designing a plot, it can be helpful to think 
about the amount of ink (Chapter 17) used to represent the data and context. A com¬ 
mon recommendation is to reduce the amount of non-data ink, and following this 
advice can often yield less cluttered and more elegant visualizations. At the same 
time, context and visual structure are important, and overly minimizing the plot ele¬ 
ments that provide them can result in figures that are difficult to read, confusing, or 
simply not that compelling. 

Providing the Appropriate Amount of Context 

The idea that distinguishing between data and non-data ink may be useful was popu¬ 
larized by Edward Tufte in his book The Visual Display of Quantitative Information 
[Tufte 2001]. Tufte introduces the concept of the “data-ink ratio,” which he defines as 
the “proportion of a graphics ink devoted to the non-redundant display of data infor¬ 
mation.” He then writes (emphasis mine): 

Maximize the data-ink ratio, within reason. 

I have emphasized the phrase “within reason” because it is critical and frequently for¬ 
gotten. In fact, I think that Tufte himself forgets it in the remainder of his book, 
where he advocates overly minimalistic designs that, in my opinion, are neither ele¬ 
gant nor easy to decipher. If we interpret the phrase “maximize the data-ink ratio” to 
mean “remove clutter and strive for clean and elegant designs,” then I think it is rea¬ 
sonable advice. But if we interpret it as “do everything you can to remove non-data 
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ink,” then it will result in poor design choices. If we go too far in either direction we 
will end up with ugly figures. However, away from the extremes there is a wide range 
of designs that are all acceptable and may be appropriate in different settings. 

To explore the extremes, let’s consider a figure that has far too much non-data ink 
(Figure 23-1). The colored points in the plot panel (the framed center area containing 
data points) are data ink. Everything else is non-data ink. The non-data ink includes a 
frame around the entire figure, a frame around the plot panel, and a frame around 
the legend. None of these frames are needed. We also see a prominent and dense 
background grid that draws attention away from the actual data points. By removing 
the frames and minor grid lines and by drawing the major grid lines in a light gray, 
we arrive at Figure 23-2. In this version of the figure, the actual data points stand out 
much more clearly, and they are perceived as the most important component of the 
figure. 
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Figure 23-1. Percent body fat versus height in professional male Australian athletes. 
Each point represents one athlete. This figure devotes way too much ink to non-data. 
There are unnecessary frames around the entire figure, around the plot panel, and 
around the legend. The coordinate grid is very prominent, and its presence draws atten¬ 
tion away from the data points. Data source: [Telford and Cunningham 1991], 
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Figure 23-2. Percent body fat versus height in professional male Australian athletes. This 
figure is a cleaned-up version of Figure 23-1. Unnecessary frames have been removed, 
minor grid lines have been removed, and major grid lines have been drawn in light gray 
to stand back relative to the data points. Data source: [Telford and Cunningham 1991]. 

At the other extreme, we might end up with a figure such as Figure 23-3, which is a 
minimalist version of Figure 23-2. In this figure, the axis tick labels and titles have 
been made so faint that they are hard to see. If we just glance at the figure we will not 
immediately perceive what data is actually shown. We only see points floating in 
space. Moreover, the legend annotations are so faint that the points in the legend 
could be mistaken for data points. This effect is amplified because there is no visual 
separation between the plot area and the legend. Notice how the background grid in 
Figure 23-2 both anchors the points in space and sets off the data area from the leg¬ 
end area. Both of these effects have been lost in Figure 23-3. 

In Figure 23-2, I am using an open background grid and no axis lines or frame 
around the plot panel. I like this design because it conveys to the viewer that the 
range of possible data values extends beyond the axis limits. For example, even 
though Figure 23-2 shows no athlete taller than 210 cm, such an athlete could con¬ 
ceivably exist. However, some authors prefer to delineate the extent of the plot panel, 
by drawing a frame around it (Figure 23-4). Both options are reasonable, and which 
is preferable is primarily a matter of personal opinion. One advantage of the framed 
version is that it visually separates the legend from the plot panel. 
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Figure 23-3. Percent body fat versus height in professional male Australian athletes. In 
this example, the concept of removing non-data ink has been taken too far. The axis tick 
labels and title are too faint and are barely visible. The data points seem to float in space. 
The points in the legend are not sufficiently set off from the data points, and the casual 
observer might think they are part of the data. Data source: [Telford and Cunningham 
1991]. 
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Figure 23-4. Percent body fat versus height in professional male Australian athletes. This 
figure adds a frame around the plot panel of Figure 23-2, and this frame helps separate 
the legend from the data. Data source: [Telford and Cunningham 1991 ]. 
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Figures with too little non-data ink commonly suffer from the effect that figure ele¬ 
ments appear to float in space, without clear connection or reference to anything. 
This problem tends to be particularly severe in small multiples plots. Figure 23-5 
shows a small multiples plot comparing six different bar plots, but it looks more like a 
piece of modern art than a useful data visualization. The bars are not anchored to a 
baseline and the individual plot facets are not clearly delineated. We can resolve these 
issues by adding a light gray background and thin horizontal grid lines to each facet 
(Figure 23-6). 



Figure 23-5. Survival of passengers on the Titanic, broken down by gender and class. 
This small multiples plot is too minimalistic. The individual facets are not framed, so it’s 
difficult to see which part of the figure belongs to which facet. Further, the individual 
bars are not anchored to a baseline, and they seem to float. Data source: Encyclopedia 
Titanica. 
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Figure 23-6. Survival of passengers on the Titanic, broken down by gender and class. 

This is an improved version of Figure 23-5. The gray background in each facet clearly 
delineates the six groupings (survived or died in 1st, 2nd, or 3rd class) that make up this 
plot. Thin horizontal lines in the background provide a reference for the bar heights and 
facilitate comparison of bar heights among facets. Alternatively, we could put a frame 
around each individual plot panel and use gray bars to highlight the grouping variables 
(see Figure 21-1 ). Data source: Encyclopedia Titanica. 

Background Grids 

Grid lines in the background of a plot can help the reader discern specific data values 
and compare values in one part of a plot to values in another part. At the same time, 
grid lines can add visual noise, in particular when they are prominent or densely 
spaced. Reasonable people can disagree about whether to use a grid or not, and if so 
how to format it and how densely to space it. Throughout this book I am using a vari¬ 
ety of different grid styles, to highlight that there isn’t necessarily one best choice. 

The R software ggplot2 has popularized a style using a fairly prominent background 
grid of white lines on a gray background. Figure 23-7 shows an example in this style. 
The figure displays the change in stock price of four major tech companies over a 
five-year window, from 2012 to 2017. With apologies to the ggplot2 author Hadley 
Wickham, for whom I have the utmost respect, I don’t find the white-on-gray back¬ 
ground grid particularly attractive. To my eye, the gray background can detract from 
the actual data, and a grid with major and minor lines can be too dense. I also find 
the gray squares in the legend confusing. 
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Figure 23-7. Stock price over time for four major tech companies. The stock price for 
each company has been normalized to equal 100 in June 2012. This figure mimics the 
ggplot2 default look, with white major and minor grid lines on a gray background. In 
this particular example, I think the grid lines overpower the data lines, and the result is 
a figure that is not well balanced and that doesn’t place sufficient emphasis on the data. 
Data source: Yahoo! Finance. 


Arguments in favor of the gray background include that it both helps the plot to be 
perceived as a single visual entity and prevents the plot from appearing as a white box 
in surrounding dark text [Wickham 2016]. I completely agree with the first point, and 
it was the reason I used gray backgrounds in Figure 23-6. For the second point, I’d 
like to caution that the perceived darkness of text will depend on the font size, font 
face, and line spacing, and the perceived darkness of a figure will depend on the abso¬ 
lute amount and color of ink used, including all data ink. A scientific paper typeset in 
dense, 10-point Times New Roman will look much darker than a coffee table book 
typeset in 14-point Palatino with one-and-a-half line spacing. Likewise, a scatterplot 
of 5 data points in yellow will look much lighter than a scatterplot of 10,000 data 
points in black. If you want to use a gray figure background, consider the color inten¬ 
sity of your figure foreground, as well as the expected layout and typography of the 
text around your figures, and adjust the choice of your background gray accordingly. 
Otherwise, it could happen that your figures end up standing out as dark boxes 
among the surrounding lighter text. Also, keep in mind that the colors you use to plot 
your data need to work with the gray background. We tend to perceive colors differ¬ 
ently against different backgrounds, and a gray background requires darker and more 
saturated foreground colors than a white background. 
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We can go all the way in the opposite direction and remove both the background and 
the grid lines (Figure 23-8). In this case, we need visible axis lines to frame the plot 
and keep it as a single visual unit. For this particular figure, I think this choice is a 
worse option, and I have labeled it as “bad.” In the absence of any background grid 
whatsoever, the curves seem to float in space, and it’s difficult to reference the final 
values on the right to the axis ticks on the left. 
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Figure 23-8. Indexed stock price over time for four major tech companies. In this variant 
of Figure 23-7, the data lines are not sufficiently anchored. This makes it difficult to 
ascertain to what extent they have deviated from the index value of 100 at the end of the 
covered time interval. Data source: Yahoo! Finance. 


At the absolute minimum, we need to add one horizontal reference line. Since the 
stock prices in Figure 23-8 are indexed to 100 in June 2012, marking this value with a 
thin horizontal line at y = 100 helps a lot (Figure 23-9). Alternatively, we can use a 
minimal “grid” of horizontal lines. For a plot where we are primarily interested in the 
change in y values, vertical grid lines are not needed. Moreover, grid lines positioned 
at only the major axis ticks will often be sufficient, and the axis line can be omitted or 
made very thin since the horizontal lines mark the extent of the plot (Figure 23-10). 
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Figure 23-9. Indexed stock price over time for four major tech companies. Adding a thin 
horizontal line at the index value of 100 to Figure 23-8 helps provide an important refer¬ 
ence throughout the entire time period the plot spans. Data source: Yahoo! Finance. 
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Figure 23-10. Indexed stock price over time for four major tech companies. Adding thin 
horizontal lines at all major y-axis ticks provides a better set of reference points than just 
the one horizontal line of Figure 23-9. This design also removes the need for prominent 
x- and y-axis lines, since the evenly spaced horizontal lines create a visual frame for the 
plot panel. Data source: Yahoo! Finance. 
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For such a minimal grid, we generally draw the lines orthogonally to the direction 
along which the numbers of interest vary. Therefore, if instead of plotting the stock 
price over time we plot the five-year increase, as horizontal bars, then we will want to 
use vertical lines instead (Figure 23-11). 



percent increase 


Figure 23-11. Percent increase in stock price from June 2012 to June 2017, for four major 
tech companies. Because the bars run horizontally, vertical grid lines are appropriate 
here. Data source: Yahoo! Finance. 



Grid lines that run perpendicular to the key variable of interest 
tend to be the most useful. 


For bar graphs such as Figure 23-11, Tufte recommends drawing white grid lines on 
top of the bars instead of dark grid lines underneath [Tufte 2001], These white grid 
lines have the effect of separating the bars into distinct segments of equal length 
(Figure 23-12). I’m of two minds on this style. On the one hand, research into human 
perception suggests that breaking bars into discrete segments helps the viewer to per¬ 
ceive bar lengths [Haroz, Kosara, and Franconeri 2015]. On the other hand, to my eye 
the bars look like they are falling apart and don’t form a visual unit. In fact, I used this 
style purposefully in Figure 6-10 to visually separate stacked bars representing male 
and female passengers. Which effect dominates may depend on the specific choices of 
bar width, distance between bars, and thickness of the white grid lines. Thus, if you 
intend to use this style, I encourage you to vary these parameters until you have a 
figure that creates the desired visual effect. 
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Figure 23-12. Percent increase in stock price from June 2012 to June 2017, for four major 
tech companies. White grid lines on top of bars can help the reader perceive the relative 
lengths of the bars. At the same time, they can also create the perception that the bars 
are falling apart. Data source: Yahoo! Finance. 

I would like to point out another downside of Figure 23-12. I had to move the per¬ 
centage values outside the bars, because the labels didn’t fit into the final segments of 
several of the bars. However, this choice inappropriately visually elongates the bars 
and should be avoided whenever possible. 

Background grids along both axis directions are most appropriate for scatterplots 
where there is no primary axis of interest. Figure 23-2 at the beginning of this chapter 
provides an example. When a figure has a full background grid, axis lines are gener¬ 
ally not needed. 

Paired Data 

For figures where the relevant comparison is the x = y line, such as in scatterplots of 
paired data, I prefer to draw a diagonal line rather than a grid. For example, consider 
Figure 23-13, which compares gene expression levels in a mutant virus to the nonmu- 
tated (wild-type) variant. The diagonal line allows us to see immediately which genes 
are expressed higher or lower in the mutant relative to the wild type. The same obser¬ 
vation is much harder to make when the figure has a background grid and no diago¬ 
nal line (Figure 23-14). Thus, even though Figure 23-14 looks pleasing, I label it as 
bad. In particular, gene 10A, which has a significantly reduced expression level in the 
mutant relative to the wild-type virus (Figure 23-13), does not visually stand out in 
Figure 23-14. 
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Figure 23-13. Gene expression levels in a mutant bacteriophage T7 relative to wild type. 
Gene expression levels are measured by mRNA abundances, in transcripts per million 
(TPM). Each dot corresponds to one gene. In the mutant bacteriophage T7, the promoter 
in front of gene 9 was deleted, and this resulted in reduced mRNA abundances of gene 9 
as well as the neighboring genes 8 and 10A (highlighted). Data source: [Paffet al. 2018]. 
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Figure 23-14. Gene expression levels in a mutant bacteriophage T7 relative to wild type. 
By plotting this dataset against a background grid instead of a diagonal line, we are 
obscuring which genes are higher or lower in the mutant than in the wild-type bacterio¬ 
phage. Data source: [Paffet al. 2018], 

Of course, we could take the diagonal line from Figure 23-13 and add it on top of the 
background grid of Figure 23-14, to ensure that the relevant visual reference is 
present. However, the resulting figure is getting quite busy (Figure 23-15). I had to 
make the diagonal line darker so it would stand out against the background grid, but 
now the data points almost seem to fade into the background. We could ameliorate 
this issue by making the data points larger or darker, but all considered I’d rather 
choose Figure 23-13. 
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Figure 23-15. Gene expression levels in a mutant bacteriophage T7 relative to wild type. 
This figure combines the background grid from Figure 23-14 with the diagonal line from 
Figure 23-13. In my opinion, this figure is visually too busy compared to Figure 23-13, 
and I would prefer Figure 23-13. Data source: [Paffet al. 2018]. 

Summary 

Both overloading a figure with non-data ink and excessively erasing non-data ink can 
result in poor figure design. We need to find a healthy medium, where the data points 
are the main emphasis of the figure while sufficient context is provided about what 
data is shown, where the points lie relative to each other, and what they mean. 

With respect to backgrounds and background grids, there is no one choice that is 
preferable in all contexts. I recommend being judicious about grid lines. Think care¬ 
fully about which specific grid or guide lines are most informative for the plot you are 
making, and then only show those. I prefer minimal, light grids on a white back¬ 
ground, since white is the default neutral color on paper and supports nearly any 
foreground color. However, a shaded background can help the plot appear as a single 
visual entity, and this may be particularly useful in small multiples plots. Finally, we 
have to consider how all these choices relate to visual branding and identity. Many 
magazines and websites like to have an immediately recognizable in-house style, and 
a shaded background and specific choice of background grid can help create a unique 
visual identity. 
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CHAPTER 24 


Use Larger Axis Labels 


If you take away only one single lesson from this book, make it this one: pay attention 
to your axis labels, axis tick labels, and other assorted plot annotations. Chances are 
they are too small. In my experience, nearly all graphing software and plot libraries 
have poor defaults. If you use the default values, you’re almost certainly making a 
poor choice. 

For example, consider Figure 24-1. I see figures like this all the time. The axis labels, 
axis tick labels, and legend labels are all incredibly small. We can barely see them, and 
we may have to zoom into the page to read the annotations in the legend. 

A somewhat better version of this figure is shown as Figure 24-2. 1 think the fonts are 
still too small, and that’s why I have labeled the figure as ugly. However, we are mov¬ 
ing in the right direction. This figure might be passable under some circumstances. 
My main criticism here is not so much that the labels aren’t legible as that the figure is 
not balanced; the text elements are too small compared to the rest of the figure. 
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Figure 24-1. Percent body fat versus height in professional male Australian athletes. 
(Each point represents one athlete.) This figure suffers from the common affliction that 
the text elements are way too small and are barely legible. Data source: [Telford and 
Cunningham 1991]. 
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Figure 24-2. Percent body fat versus height in male athletes. This figure is an improve¬ 
ment over Figure 24-1, but the text elements remain too small and the figure is not bal¬ 
anced. Data source: [Telford and Cunningham 1991]. 
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Figure 24-3. Percent body fat versus height in male athletes. All figure elements are 
appropriately scaled. Data source: [Telford and Cunningham 1991 ]. 

Figure 24-3 uses the default settings I’m applying throughout this book. I think it is 
well balanced; the text is legible, and it fits with the overall size of the figure. 

Importantly, we can overdo it and make the labels too big (Figure 24-4). Sometimes 
we need big labels—for example, if the figure is meant to be reduced in size—but the 
various elements of the figure (in particular, label text and plot symbols) need to fit 
together. In Figure 24-4, the points used to visualize the data are too small relative to 
the text. Once we fix this issue, the figure becomes acceptable again (Figure 24-5). 
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Figure 24-4. Percent body fat versus height in male athletes. The text elements are fairly 
large, and their size may be appropriate if the figure is meant to be reproduced at a very 
small scale. However, the figure overall is not balanced; the points are too small relative 
to the text elements. Data source: [Telford and Cunningham 1991 ]. 
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Figure 24-5. Percent body fat versus height in male athletes. All figure elements are sized 
such that the figure is balanced and can be reproduced at a small scale. Data source: 
[Telford and Cunningham 1991 ]. 
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You may look at Figure 24-5 and find everything too big. However, keep in mind that 
it is meant to be scaled down. Scale the figure down so that it is only two to three 
inches in width, and it looks just fine. In fact, at that scaling this is the only figure in 
this chapter that looks good. 



Always look at scaled-down versions of your figures to make sure 
the axis labels are appropriately sized. 


I think there is a simple psychological reason for why we routinely make figures 
whose axis labels are too small, and it relates to large, high-resolution computer mon¬ 
itors. We routinely preview figures on the computer screen, and often we do so while 
the figure takes up a large amount of space on the screen. In this viewing mode, even 
comparatively small text seems perfectly fine and legible, and large text can seem 
awkward and overpowering. In fact, if you take the first figure from this chapter and 
magnify it to the point where it fills your entire screen, you will likely think that it 
looks just fine. The solution is to always make sure that you look at your figures at a 
realistic print size. You can either zoom out so they are only three to five inches in 
width on your screen, or stand well back and check whether the figure still looks 
good from a substantial distance. 
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CHAPTER 25 


Avoid Line Drawings 


Whenever possible, visualize your data with solid, colored shapes rather than with 
lines that outline those shapes. Solid shapes are more easily perceived as coherent 
objects, are less likely to create visual artifacts or optical illusions, and more immedi¬ 
ately convey amounts than do outlines. In my experience, visualizations using solid 
shapes are both clearer and more pleasant to look at than equivalent versions that use 
line drawings. Thus, I avoid line drawings as much as possible. However, I want to 
emphasize that this recommendation does not supersede the principle of propor¬ 
tional ink (Chapter 17). 

Line drawings have a long history in the field of data visualization because through¬ 
out most of the 20th century, scientific visualizations were drawn by hand and had to 
be reproducible in black and white. This precluded the use of areas filled with solid 
colors, including solid grayscale fills. Instead, filled areas were sometimes simulated 
by applying hatch, cross-hatch, or stipple patterns. Early plotting software imitated 
the hand-drawn simulations and similarly made extensive use of line drawings, 
dashed or dotted line patterns, and hatching. While modern visualization tools and 
modern reproduction and publishing platforms have none of the earlier limitations, 
many plotting applications still default to outlines and empty shapes rather than filled 
areas. To raise your awareness of this issue, here I’ll show you several examples of the 
same figures drawn with both lines and filled shapes. 

The most common and at the same time most inappropriate use of line drawings is 
seen in histograms and bar plots. The problem with bars drawn as outlines is that it is 
not immediately apparent which side of any given line is inside a bar and which side 
is outside. As a consequence, in particular when there are gaps between bars, we end 
up with a confusing visual pattern that detracts from the main message of the figure 
(Figure 25-1). Filling the bars with a light color, or with gray if color reproduction is 
not possible, avoids this problem (Figure 25-2). 
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Figure 25-1. Histogram of the ages of Titanic passengers, drawn with empty bars. The 
empty bars create a confusing visual pattern. In the center of the histogram, it is difficult 
to tell which parts are inside of bars and which parts are outside. Data source: Encyclo¬ 
pedia Titanica. 
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Figure 25-2. Histogram of the ages of Titanic passengers. This is the same histogram as 
in Figure 25-1, now drawn with filled bars. The shape of the age distribution is much 
more easily discernible in this variation of the figure. Data source: Encyclopedia 
Titanica. 
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Next, let’s take a look at an old-school density plot. I’m showing density estimates for 
the sepal length distributions of three species of Iris, drawn entirely in black and 
white as a line drawing (Figure 25-3). The distributions are shown just by their out¬ 
lines, and because the figure is in black and white I’m using different line styles to 
distinguish them. This figure has two main problems. First, the dashed line styles do 
not provide a clear separation between the area under the curve and the area above it. 
While the human visual system is quite good at connecting the individual line ele¬ 
ments into a continuous line, the dashed lines nevertheless look porous and do not 
serve as a strong boundary for the enclosed area. Second, because the lines intersect 
and the areas they enclose are not shaded, it is difficult to segment the different densi¬ 
ties from the six distinct shape outlines. This effect would have been even stronger 
had I used solid rather than dashed lines for all three distributions. 



Figure 25-3. Density estimates of the sepal lengths of three different Iris species. The bro¬ 
ken line styles used for Iris versicolor and Iris virginica detract from the perception that 
the areas under the curves are distinct from the areas above them. Data source: [Fisher 
1936]. 

We can attempt to address the problem of porous boundaries by using colored lines 
rather than dashed lines (Figure 25-4). However, the density areas in the resulting 
plot still have little visual presence. Overall, I find the version with filled areas 
(Figure 25-5) the most clear and intuitive. It is important, however, to make the filled 
areas partially transparent, so that the complete distribution for each species is visible. 
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Figure 25-4. Density estimates of the sepal lengths of three different Iris species. By using 
solid, colored lines we have solved the problem of Figure 25-3 that the areas below and 
above the lines seem to be connected. However, we still don’t have a strong sense of the 
size of the area under each curve. Data source: [Fisher 1936]. 



Figure 25-5. Density estimates of the sepal lengths of three different Iris species, shown 
as partially transparent shaded areas. The shading helps us perceive the three density 
curves as three distinct objects. Data source: [Fisher 1936]. 
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Line drawings also arise in the context of scatterplots, when different point types are 
drawn as open circles, triangles, crosses, etc. As an example, consider Figure 25-6. 
The figure contains a lot of visual noise, and the different point types do not strongly 
separate from each other. Drawing the same figure with solidly colored shapes 
addresses this issue (Figure 25-7). 
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Figure 25-6. City fuel economy versus engine displacement, for cars with front-wheel 
drive (FWD), rear-wheel drive (RWD), and all-wheel drive (4WD). The different point 
styles, all black-and-white line-drawn symbols, create substantial visual noise and make 
it difficult to read the figure. Data source: US Environmental Protection Agency (EPA), 
https://fueleconomy.gov. 
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Figure 25-7. City fuel economy versus engine displacement. By using both different col¬ 
ors and different solid shapes for the different drive-train variants, this figure visually 
separates the drive-train variants while remaining reproducible in grayscale if needed. 
Data source: EPA. 

I strongly prefer solid points over open points, because the solid points have much 
more visual presence. The argument that I sometimes hear in favor of open points is 
that they help with overplotting, since the empty areas in the middle of each point 
allow us to see other points that may be lying underneath. In my opinion, the benefit 
of being able to see overplotted points does not, in general, outweigh the detriment of 
the added visual noise of open symbols. There are other approaches for dealing with 
overplotting; see Chapter 18 for some suggestions. 

Finally, let’s consider boxplots. Boxplots are commonly drawn with empty boxes, as 
in Figure 25-8. I prefer a light shading for the box, as in Figure 25-9. The shading 
separates the box from the plot background, and it helps in particular when we’re 
showing many boxplots right next to each other, as is the case in Figures 25-8 and 
25-9. In Figure 25-8, the large number of boxes and lines can again create the illusion 
of background areas outside of boxes being actually on the inside of some other 
shape, just as we saw in Figure 25-1. This problem is eliminated in Figure 25-9. 1 have 
sometimes heard the critique that shading the inside of the box gives too much 
weight to the center 50% of the data, but I don’t buy that argument. It is inherent to 
the boxplot, shaded box or not, to give more weight to the center 50% of the data 
than to the rest. If you don’t want this emphasis, then don’t use a boxplot. Instead, use 
a violin plot, jittered points, or a sina plot (Chapter 9). 
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Figure 25-8. Distributions of daily mean temperatures in Lincoln, NE, in 2016. Boxes 
are drawn in the traditional way, without shading. Data source: Weather Underground. 
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Figure 25-9. Distributions of daily mean temperatures in Lincoln, NE, in 2016. By giving 
the boxes a light gray shading, we can make them stand out better against the back¬ 
ground. Data source: Weather Underground. 
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CHAPTER 26 


Don't Go 3D 


3D plots are quite popular, in particular in business presentations but also among 
academics. They are also almost always inappropriately used. It is rare that I see a 3D 
plot that couldn’t be improved by turning it into a regular 2D figure. In this chapter, I 
will explain why 3D plots have problems, why they generally are not needed, and in 
what limited circumstances 3D plots may be appropriate. 

Avoid Gratuitous 3D 

Many visualization tools enable you to spruce up your plots by turning the plots’ 
graphical elements into three-dimensional objects. Most commonly, we see pie charts 
turned into disks rotated in space, bar plots turned into columns, and line plots 
turned into bands. Notably, in none of these cases does the third dimension convey 
any actual data. 3D is used simply to decorate and adorn the plot. I consider this use 
of 3D as gratuitous. It is unequivocally bad and should be erased from the visual 
vocabulary of data scientists. 

The problem with gratuitous 3D is that the projection of 3D objects into two dimen¬ 
sions for printing or display on a monitor distorts the data. The human visual system 
tries to correct for this distortion as it maps the 2D projection of a 3D image back 
into a 3D space. However, this correction can only ever be partial. As an example, let’s 
take a simple pie chart with two slices, one representing 25% of the data and one 75%, 
and rotate this pie in space (Figure 26-1). As we change the angle at which we’re look¬ 
ing at the pie, the size of each slice seems to change as well. In particular, the 25% 
slice, which is located in the front of the pie, seems to take up much more than 25% 
of the area when we look at the pie from a flat angle (Figure 26-la). 
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Figure 26-1. The same 3D pie chart shown from four different angles. Rotating a pie into 
the third dimension makes pie slices in the front appear larger than they really are and 
pie slices in the back appear smaller. Here, in parts (a), (b), and (c), the blue slice corre¬ 
sponding to 25% of the data visually occupies more than 25% of the area representing 
the pie. Only part (d) is an accurate representation of the data. 

Similar problems arise for other types of 3D plot. Figure 26-2 shows the breakdown 
of Titanic passengers by class and gender using 3D bars. Because of the way the bars 
are arranged relative to the axes, the bars all look shorter than they actually are. For 
example, there were 322 passengers total traveling in first class, yet Figure 26-2 sug¬ 
gests that the number was less than 300. This illusion arises because the columns rep¬ 
resenting the data are located at a distance from the two back surfaces on which the 
gray horizontal lines are drawn. To see this effect, consider extending any of the bot¬ 
tom edges of one of the columns until it hits the lowest gray line, which represents 0. 
Then, imagine doing the same to any of the top edges, and you’ll see that all the col¬ 
umns are taller than they appear at first glance. (See Figure 6-10 in Chapter 6 for a 
more reasonable 2D version of this figure.) 
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Figure 26-2. Numbers of female and male passengers on the Titanic traveling in 1st, 2nd, 
and 3rd class, shown as a 3D stacked bar plot. The total numbers of passengers in 1st, 
2nd, and 3rd class are 322, 279, and 711, respectively (see Figure 6-10). Yet in this plot, 
the 1st class bar appears to represent fewer than 300 passengers, the 3rd class bar 
appears to represent fewer than 700 passengers, and the 2nd class bar seems to be closer 
to 210 passengers than the actual 279 passengers. Furthermore, the 3rd class bar visually 
dominates the figure and makes the number of passengers in 3rd class appear larger 
than it actually is. 


Avoid 3D Position Scales 


While visualizations with gratuitous 3D can easily be dismissed as bad, it is less clear 
what to think of visualizations using three genuine position scales (x, y, and z) to rep¬ 
resent data. In this case, the use of the third dimension serves an actual purpose. Nev¬ 
ertheless, the resulting plots are frequently difficult to interpret, and in my mind they 
should be avoided. 

Consider a 3D scatterplot of fuel efficiency versus displacement and power for 32 
cars. We saw this dataset previously in Chapter 2 (Figure 2-5). Here, we plot displace¬ 
ment along the x axis, power along the y axis, and fuel efficiency along the z axis, and 
we represent each car with a dot (Figure 26-3). Even though this 3D visualization is 
shown from four different perspectives, it is difficult to envision how exactly the 
points are distributed in space. I find part (d) of Figure 26-3 particularly confusing. It 
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almost seems to show a different dataset, even though nothing has changed other 
than the angle from which we look at the dots. 
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Figure 26-3. Fuel efficiency versus displacement and power for 32 cars (1973-74 mod¬ 
els). Each dot represents one car, and the dot color represents the number of cylinders of 
the car. The four panels (a)-(d) show exactly the same data but use different perspec¬ 
tives. Data source: Motor Trend, 1974. 
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The fundamental problem with such 3D visualizations is that they require two sepa¬ 
rate, successive data transformations. The first transformation maps the data from the 
data space into the 3D visualization space, as discussed in Chapters 2 and 3 in the 
context of position scales. The second one maps the data from the 3D visualization 
space into the 2D space of the final figure. (This second transformation obviously 
does not occur for visualizations shown in a true 3D environment, such as when 
shown as physical sculptures or 3D-printed objects. My primary objection here is to 
3D visualizations shown on 2D displays.) The second transformation is noninverti- 
ble, because each point on the 2D display corresponds to a line of points in the 3D 
visualization space. Therefore, we cannot uniquely determine where in 3D space any 
particular data point lies. 

Our visual system nevertheless attempts to invert the 3D to 2D transformation. How¬ 
ever, this process is unreliable, fraught with error, and strongly dependent on appro¬ 
priate cues in the image that convey some sense of three-dimensionality. When we 
remove these cues the inversion becomes entirely impossible. This can be seen in 
Figure 26-4, which is identical to Figure 26-3 except all depth cues have been 
removed. The result is four random arrangements of points that we cannot interpret 
at all and that aren’t even easily relatable to each other. Could you tell which points in 
part (a) correspond to which points in part (b)? I certainly cannot. 

Instead of applying two separate data transformations, one of which is noninvertible, 
I think it is generally better to just apply one appropriate, invertible transformation 
and map the data directly into 2D space. It is rarely necessary to add a third dimen¬ 
sion as a position scale, since variables can also be mapped onto color, size, or shape 
scales. For example, in Chapter 2, 1 plotted five variables of the fuel-efficiency dataset 
at once yet used only two position scales (Figure 2-5). 
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Figure 26-4. Fuel efficiency versus displacement and power for 32 cars (1973-74 mod¬ 
els). The four panels (a)-(d) correspond to the same panels in Figure 26-3, but all the 
grid lines providing depth cues have been removed. Data source: Motor Trend, 1974. 

Here, I want to show two alternative ways of plotting exactly the variables used in 
Figure 26-3. First, if we primarily care about fuel efficiency as the response variable, 
we can plot it twice, once against displacement and once against power (Figure 26-5). 
Second, if we are more interested in how displacement and power relate to each 
other, with fuel efficiency as a secondary variable of interest, we can plot power 
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versus displacement and map fuel efficiency onto the size of the dots (Figure 26-6). 
Both figures are more useful and less confusing than Figure 26-3. 
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Figure 26-5. Fuel efficiency versus displacement (a) and power (b)for 32 cars. Data 
source: Motor Trend, 1974. 
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Figure 26-6. Power versus displacement for 32 cars, with fuel efficiency represented by 
dot size. Data source: Motor Trend, 1974. 


You may wonder whether the problem with 3D scatterplots is that the actual data rep¬ 
resentation, the dots, do not themselves convey any 3D information. What happens, 


Avoid 3D Position Scales | 311 









for example, if we use 3D bars instead? Figure 26-7 shows a typical dataset that one 
might visualize with 3D bars, the mortality rates in 1940 Virginia stratified by age 
group and by gender and housing location. We can see that indeed the 3D bars help 
us interpret the plot. It is unlikely that one might mistake a bar in the foreground for 
one in the background, or vice versa. Nevertheless, the problems discussed in the 
context of Figure 26-2 exist here as well. It is difficult to judge exactly how tall the 
individual bars are, and it is also difficult to make direct comparisons. For example, 
was the mortality rate of urban females in the 65-69 age group higher or lower than 
that of urban males in the 60-64 age group? 


bad 



' 70-74 
65-69 


Rural Male 50-54 


Figure 26-7. Mortality rates in Virginia in 1940, visualized as a 3D bar plot. Mortality 
rates are shown for four groups of people (urban and rural females and males) and five 
age categories (50-54, 55-59, 60-64, 65-69, 70-74), and they are reported in units of 
deaths per 1,000 persons. This figure is labeled as “bad” because the 3D perspective 
makes the plot difficult to read. Data source: [Molyneaux, Gilliam, and Florant 1947]. 

In general, it is better to use small multiples plots (Chapter 21) instead of 3D visuali¬ 
zations. The Virginia mortality dataset requires only four panels when shown as a 
small multiples plot (Figure 26-8). I consider this figure clear and easy to interpret. It 
is immediately obvious that mortality rates were higher among men than among 
women, and also that urban males seem to have had higher mortality rates than rural 
males, whereas no such trend is apparent for urban and rural females. 
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Figure 26-8. Mortality rates in Virginia in 1940, visualized as a small multiples plot. 
Mortality rates are shown for four groups of people (urban and rural females and males) 
and five age categories (50-54, 55-59, 60-64, 65-69, 70-74), and they are reported in 
units of deaths per 1,000 persons. Data source: [Molyneaux, Gilliam, and Florant 1947]. 

Appropriate Use of 3D Visualizations 

Visualizations using 3D position scales can sometimes be appropriate, however. First, 
the issues described in the preceding section are of lesser concern if the visualization 
is interactive and can be rotated by the viewer, or if it is shown in a VR or augmented 
reality environment where it can be inspected from multiple angles. Second, even if 
the visualization isn’t interactive, showing it slowly rotating, rather than as a static 
image from one perspective, will allow the viewer to discern where in 3D space differ¬ 
ent graphical elements reside. The human brain is very good at reconstructing a 3D 
scene from a series of images taken from different angles, and the slow rotation of the 
graphic provides exactly these images. 

Finally, it makes sense to use 3D visualizations when we want to show actual 3D 
objects and/or data mapped onto them. For example, showing the topographic relief 
of a mountainous island is a reasonable choice (Figure 26-9). Similarly, if we want to 
visualize the evolutionary sequence conservation of a protein mapped onto its struc¬ 
ture, it makes sense to show the structure as a 3D object (Figure 26-10). In either 
case, however, these visualizations would still be easier to interpret if they were shown 
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as rotating animations. While this is not possible in traditional print publications, it 
can be done easily when posting figures on the web or when giving presentations. 



Figure 26-9. Relief of the Island of Corsica in the Mediterranean Sea. Data source: 
Copernicus Land Monitoring Service. 
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Figure 26-10. Patterns of evolutionary variation in a protein. The colored tube represents 
the backbone of the protein Exonuclease III from the bacterium Escherichia coli. The 
coloring indicates the evolutionary conservation of the individual sites in this protein, 
with dark coloring indicating conserved amino acids and light coloring indicating vari¬ 
able amino acids. Data source: [Marcos and Echave 2015], 
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PARTIN 


Miscellaneous Topics 




CHAPTER 27 


Understanding the Most Commonly Used 

Image File Formats 


Anybody who is making figures for data visualization will eventually have to know a 
few things about how figures are stored on the computer. There are many different 
image file formats, and each has its own set of benefits and disadvantages. Choosing 
the right file format and the right workflow can alleviate many figure preparation 
headaches. 

My own preference is to use PDF for high-quality publication-ready files and gener¬ 
ally whenever possible, PNG for online documents and other scenarios where bitmap 
graphics are required, and JPEG as the final resort if the PNG files are too large. In 
the following sections, I explain the key differences between these file formats and 
their respective benefits and drawbacks. 

Bitmap and Vector Graphics 

The most important difference between the various graphics formats is whether they 
are bitmap or vector (Table 27-1). Bitmaps or raster graphics store the image as a grid 
of individual points (called pixels), each with a specified color. By contrast, vector 
graphics store the geometric arrangement of individual graphical elements in the 
image. Thus, a vector image contains information such as “there’s a black line from 
the top-left corner to the bottom-right corner, and a red line from the bottom-left 
corner to the top-right corner,” and the actual image is recreated on the fly as it is 
displayed on screen or printed. 
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Table 27-1. Commonly used image file formats 


1 Acronym 

Name 

Type 

Application I 

PDF 

Portable Document Format 

Vector 

General purpose 

EPS 

Encapsulated PostScript 

Vector 

General purpose, outdated; use PDF 

SVG 

Scalable Vector Graphics 

Vector 

Online use 

PNG 

Portable Network Graphics 

Bitmap 

Optimized for line drawings 

JPEG/JPG 

Joint Photographic Experts Group 

Bitmap 

Optimized for photographic images 

TIFF 

Tagged Image File Format 

Bitmap 

Print production, accurate color reproduction 

RAW 

Raw Image File 

Bitmap 

Digital photography, needs post-processing 

GIF 

Graphics Interchange Format 

Bitmap 

Outdated for static figures, OK for animations 


Vector graphics are also called “resolution-independent,” because they can be magni¬ 
fied to arbitrary size without losing detail or sharpness. See Figure 27-1 for a demon¬ 
stration. 




Figure 27-1. Illustration of the key difference between vector graphics and bitmaps, (a) 
Original image. The black square indicates the area we are magnifying in parts (b) and 
(c). (b) Increasing magnification of the highlighted area from part (a) when the image 
has been stored as a bitmap graphic. We can see how the image becomes increasingly 
pixelated as we zoom in further, (c) Increasing magnification of a vector representation 
of the image. The image maintains perfect sharpness at arbitrary magnification levels. 

Vector graphics have two downsides that can and often do cause trouble in real-world 
applications. First, because vector graphics are redrawn on the fly by the graphics 
program with which they are displayed, it can happen that there are differences in 
how the same graphic looks in two different programs, or on two different comput¬ 
ers. This problem occurs most frequently with text, for example when the required 
font is not available and the rendering software substitutes a different font. Font sub¬ 
stitutions will typically allow the viewer to read the text as intended, but the resulting 
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image rarely looks good. There are ways to avoid these problems, such as outlining or 
embedding all fonts in a PDF file, but they may require special software and/or spe¬ 
cial technical knowledge to achieve. By contrast, bitmap images will always look the 
same. 

Second, for very large and/or complex figures, vector graphics can grow to enormous 
file sizes and be slow to render. For example, a scatterplot of millions of data points 
will contain the x and y coordinates of every individual point, and each point needs to 
be drawn when the image is rendered, even if points overlap and/or are hidden by 
other graphical elements. As a consequence, the file may be many megabytes in size, 
and it may take the rendering software some time to display the figure. When I was a 
postdoc in the early 2000s, I once created a PDF file that at the time took almost an 
hour to display in Acrobat Reader. While modern computers are much faster and 
rendering times of many minutes are all but unheard of these days, even a rendering 
time of a few seconds can be disruptive if you want to embed your figure into a larger 
document and your PDF reader grinds to a halt every time you display the page with 
that one offending figure. Of course, on the flip side, simple figures with only a small 
number of elements (a few data points and some text, say) will often be much smaller 
as vector graphics than as bitmaps, and the viewing software may even render such 
figures faster than it would the corresponding bitmap images. 

Lossless and Lossy Compression of Bitmap Graphics 

Most bitmap file formats employ some form of data compression to keep file sizes 
manageable. There are two fundamental types of compression: lossless and lossy. 
Lossless compression guarantees that the compressed image is pixel-for-pixel identi¬ 
cal to the original image, whereas lossy compression accepts some image degradation 
in return for smaller file sizes. 

To understand when using either lossless or lossy compression is appropriate, it is 
helpful to have a basic understanding of how these different compression algorithms 
work. Let’s first consider lossless compression. Imagine an image with a black back¬ 
ground, where large areas of the image are solid black and thus many black pixels 
appear right next to each other. Each black pixel can be represented by three zeros in 
a row, 0 0 0, representing zero intensities in the red, green, and blue color channels 
of the image. The areas of black background in the image correspond to thousands of 
zeros in the image file. Now assume somewhere in the image are 1,000 consecutive 
black pixels, corresponding to 3,000 zeros. Instead of writing out all these zeros, we 
could simply store the total number of zeros we need, for example by writing 3000 0. 
In this way, we have conveyed the exact same information with only two numbers, 
the count (here, 3000) and the value (here, 0). Over the years, many clever tricks 
along these lines have been developed, and modern lossless image formats (such as 
PNG) can store bitmap data with impressive efficiency. However, all lossless 
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compression algorithms perform best when images have large areas of uniform color, 
and therefore Table 27-1 lists PNG as optimized for line drawings. 

Photographic images rarely have multiple pixels of identical color and brightness 
right next to each other. Instead, they have gradients and other somewhat regular pat¬ 
terns on many different scales. Therefore, lossless compression of these images often 
doesn’t work very well, and lossy compression has been developed as an alternative. 
The key idea of lossy compression is that some details in an image are too subtle for 
the human eye to see, and those can be discarded without obvious degradation in the 
image quality. For example, consider a gradient of 1,000 pixels, each with a slightly 
different color value. Chances are the gradient will look nearly the same if it is drawn 
with only 200 different colors and each group of 5 adjacent pixels is the exact same 
color. 

The most widely used lossy image format is JPEG (Table 27-1), and indeed many dig¬ 
ital cameras output images as JPEG by default. JPEG compression works exception¬ 
ally well for photographic images, and huge reductions in file size can often be 
obtained with very little degradation in image quality. However, JPEG compression 
fails when images contain sharp edges, such as those created by line drawings or by 
text. In those cases, JPEG compression can result in very noticeable artifacts 
(Figure 27-2). 

Even if JPEG artifacts are sufficiently subtle that they are not immediately visible to 
the naked eye, they can cause trouble, for example in print production. Therefore, it 
is a good idea to avoid the JPEG format whenever possible. In particular, you should 
avoid it for images containing line drawings or text, as is the case for data visualiza¬ 
tions or screenshots. The appropriate format for those images is PNG or TIFF. I use 
the JPEG format exclusively for photographic images. If an image contains both pho¬ 
tographic elements and line drawings or text, you should still use PNG or TIFF. The 
worst-case scenario with those file formats is that your image files grow large, 
whereas the worst-case scenario with JPEG is that your final product looks ugly. 
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Figure 27-2. Illustration of JPEG artifacts, (a) The same image is reproduced multiple 
times using increasingly severe JPEG compression. The resulting file size is shown in red 
text above each image. A reduction in file size by a factor of 10, from 432 KB in the 
original image to 43 KB in the compressed image, results in only a minor perceptible 
reduction in image quality. However, a further reduction in file size by a factor of 2, to a 
mere 25 KB, leads to numerous visible artifacts, (b) Zooming in to the most highly com¬ 
pressed image reveals the various compression artifacts. Photo credit: Claus O. Wilke. 
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Converting Between Image Formats 

It is generally possible to convert any image format into any other image format. For 
example, on a Mac, you can open an image with Preview and then export to a num¬ 
ber of different formats. In this process, though, important information can get lost, 
and information is never regained. For example, after saving a vector graphic into a 
bitmap format (say, a PDF file as a JPEG), the resolution independence that is a key 
feature of the vector graphic has been lost. Conversely, saving a JPEG image into a 
PDF file does not magically turn the image into a vector graphic. The image will still 
be a bitmap image, just stored inside the PDF file. Similarly, converting a JPEG file 
into a PNG file does not remove any artifacts that may have been introduced by the 
JPEG compression algorithm. 

It is therefore a good rule of thumb to always store the original image in the format 
that maintains maximum resolution, accuracy, and flexibility. Thus, for data visuali¬ 
zations, either create your figures as PDF and then convert them into PNG or JPEG 
when necessary, or store them as high-resolution PNGs. Similarly, for images that are 
only available as bitmaps, such as digital photographs, store them in a format that 
doesn’t use lossy compression—or, if that can’t be done, compress them as little as 
possible. Also, store the images in as high a resolution as possible, and downscale 
when needed. 
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CHAPTER 28 


Choosing the Right Visualization Software 


Throughout this book, I have purposefully avoided one critical question of data visu¬ 
alization: what tools should we use to generate our figures? This question can gener¬ 
ate heated discussions, as many people have strong emotional bonds to the specific 
tools they are familiar with. I have often seen people vigorously defend their own pre¬ 
ferred tools instead of investing time in learning a new approach, even if the new 
approach has objective benefits. And I will say that sticking with the tools you know 
is not entirely unreasonable. Learning any new tool will require time and effort, and 
you will have to go through a painful transition period where getting things done 
with the new tool is much more difficult than it was with the old tool. Whether going 
through this period is worth the effort can usually only be evaluated in retrospect, 
after one has invested in learning the new tool. Therefore, regardless of the pros and 
cons of different tools and approaches, the overriding principle is that you need to 
pick a tool that works for you. If you can make the figures you want to make, without 
excessive effort, then that’s all that matters. 



The best visualization software is the one that allows you to make 
the figures you need. 


Having said this, I do think there are general principles we can use to assess the rela¬ 
tive merits of different approaches to producing visualizations. These principles 
roughly break down by how reproducible the visualizations are, how easy it is to rap¬ 
idly explore the data, and to what extent the visual appearance of the output can be 
tweaked. 
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Reproducibility and Repeatability 

In the context of scientific experiments, we refer to work as reproducible if the over¬ 
arching scientific finding of the work will remain unchanged if a different research 
group performs the same type of study. For example, if one research group finds that 
a new pain medication reduces perceived headache pain significantly without causing 
noticeable side effects and a different group subsequently studies the same medica¬ 
tion on a different patient group and has the same findings, then the work is reprodu¬ 
cible. By contrast, work is repeatable if very similar or identical measurements can be 
obtained by the same person repeating the exact same measurement procedure on 
the same equipment. For example, if I weigh my dog and find she weighs 41 lbs and 
then I weigh her again on the same scales and find again that she weighs 41 lbs, then 
this measurement is repeatable. 

With minor modifications, we can apply these concepts to data visualization. A visu¬ 
alization is reproducible if the plotted data is available and any data transformations 
that may have been applied before plotting are exactly specified. For example, if you 
make a figure and then send me the exact data that you plotted, then I can prepare a 
figure that looks substantially similar. We may be using slightly different fonts or col¬ 
ors or point sizes to display the same data, so the two figures may not be exactly iden¬ 
tical, but your figure and mine convey the same message and therefore are 
reproductions of each other. A visualization is repeatable, on the other hand, if it is 
possible to recreate the exact same visual appearance, down to the last pixel, from the 
raw data. Strictly speaking, repeatability requires that even if there are random ele¬ 
ments in the figure, such as jitter (Chapter 18), those elements were specified in a 
repeatable way and can be regenerated at a future date. For random data, repeatability 
generally requires that we specify a particular random number generator for which 
we set and record a seed. 

Throughout this book, we have seen many examples of figures that reproduce but 
don’t repeat other figures. For example, Chapter 25 shows several sets of figures that 
each show the same data but that look somewhat different. Similarly, Figure 28- la is a 
repeat of Figure 9-7, down to the random jitter that was applied to each data point, 
whereas Figure 28-lb is only a reproduction of that figure. Figure 28-lb has different 
jitter than Figure 9-7, and it also uses a sufficiently different visual design that the two 
figures look quite distinct, even if they convey the same information about the data. 
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Figure 28-1. Repeat and reproduction of a figure. Part (a) is a repeat of Figure 9-7. The 
two figures are identical down to the random jitter that was applied to each point. By 
contrast, part (b) is a reproduction but not a repeat. In particular, the jitter in part (b) 
differs from the jitter in part (a) or in Figure 9-7. Data source: Weather Underground. 


Both reproducibility and repeatability can be difficult to achieve when we’re working 
with interactive plotting software. Many interactive programs allow you to transform 
or otherwise manipulate the data but don’t keep track of every individual data trans¬ 
formation you perform, only of the final product. If you make a figure using this kind 
of program, and then somebody asks you to reproduce the figure or create a similar 
one with a different dataset, you might have difficulty doing so. During my years as a 
postdoc and a young assistant professor, I used an interactive program for all my sci¬ 
entific visualizations, and this exact issue came up several times. For example, I had 
made several figures for a scientific manuscript. When I wanted to revise the manu¬ 
script a few months later and needed to reproduce a slightly altered version of one of 
the figures, I realized that I wasn’t quite sure anymore how I had made the original 
figure in the first place. This experience has taught me to stay away from interactive 
programs as much as possible. I now make figures programmatically, by writing code 
(scripts) that generates the figures from the raw data. Programmatically generated fig¬ 
ures will generally be repeatable by anybody who has access to the generating scripts 
and the programming language and specific libraries used. 


Data Exploration Versus Data Presentation 

There are two distinct phases of data visualization, and they have very different 
requirements. The first is data exploration. Whenever you start working with a new 
dataset, you need to look at it from different angles and try various ways of visualiz¬ 
ing it, just to develop an understanding of the dataset’s key features. In this phase, 
speed and efficiency are of the essence. You need to try different types of visualiza¬ 
tions, different data transformations, and different subsets of the data. The faster you 
can iterate through different ways of looking at the data, the more you will explore, 
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and the higher the likelihood is that you will notice an important feature in the data 
that you might otherwise have overlooked. The second phase is data presentation. 
You enter it once you understand your dataset and know what aspects of it you want 
to show to your audience. The key objective in this phase is to prepare a high-quality, 
publication-ready figure that can be printed in an article or book, included in a pre¬ 
sentation, or posted on the internet. 

In the exploration stage, whether the figures you make look appealing is secondary. 
It’s fine if the axis labels are missing, the legend is messed up, or the symbols are too 
small, as long as you can evaluate the various patterns in the data. What is critical, 
however, is how easy it is for you to change how the data is shown. To truly explore 
the data, you should be able to rapidly move from a scatterplot to overlapping density 
distribution plots to boxplots to a heatmap. In Chapter 2, we saw that all visualiza¬ 
tions consist of mappings from data onto aesthetics. A well-designed data exploration 
tool will allow you to easily change which variables are mapped onto which aesthet¬ 
ics, and it will provide a wide range of different visualization options within a single 
coherent framework. In my experience, however, many visualization tools (and in 
particular libraries for programmatic figure generation) are not set up in this way. 
Instead, they are organized by plot type, where each different type of plot requires 
somewhat different input data and has its own idiosyncratic interface. Such tools can 
get in the way of efficient data exploration, because it’s difficult to remember how all 
the different plot types work. I encourage you to carefully evaluate whether your visu¬ 
alization software allows for rapid data exploration or whether it tends to get in the 
way. If it more frequently tends to get in the way, you may benefit from exploring 
alternative visualization options. 

Once we have determined how exactly we want to visualize our data, what data trans¬ 
formations we want to make, and what type of plot to use, we will commonly want to 
prepare a high-quality figure for publication. At this point, we have several different 
avenues we can pursue. First, we can finalize the figure using the same software plat¬ 
form we used for initial exploration. Second, we can switch to a platform that pro¬ 
vides us finer control over the final product, even if that platform makes it harder to 
explore. Third, we can produce a draft figure with visualization software and then 
manually post-process it with an image manipulation or illustration program such as 
Photoshop or Illustrator. Fourth, we can manually redraw the entire figure from 
scratch, either with pen and paper or using an illustration program. 

All these avenues are reasonable. However, I would like to caution against manually 
sprucing up figures in routine data analysis pipelines or for scientific publications. 
Manual steps in the figure preparation pipeline make repeating or reproducing a fig¬ 
ure inherently difficult and time-consuming. And in my experience from working in 
the natural sciences, we rarely make a figure just once. Over the course of a study, we 
may redo experiments, expand the original dataset, or repeat an experiment several 
times with slightly altered conditions. I’ve seen it many times that late in the 
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publication process, when we think everything is done and finalized, we end up 
introducing a small modification to how we analyze our data, and consequently all 
the figures have to be redrawn. And I’ve also seen, in similar situations, that a deci¬ 
sion is made not to redo the analysis or not to redraw the figures, either due to the 
effort involved or because the people who made the original figures have moved on 
and aren’t available anymore. In all these scenarios, an unnecessarily complicated and 
nonreproducible data visualization pipeline interferes with producing the best possi¬ 
ble science. 

Having said this, I have no principled concern about hand-drawn figures or figures 
that have been manually post-processed, for example to change axis labels, add anno¬ 
tations, or modify colors. These approaches can yield beautiful and unique figures 
that couldn’t easily be made in any other way. In fact, as sophisticated and polished 
computer-generated visualizations are becoming increasingly commonplace, I 
observe that manually drawn figures are making somewhat of a resurgence (see 
Figure 28-2 for an example). I think this is the case because such figures represent a 
unique and personalized take on what might otherwise be a somewhat sterile and 
routine presentation of data. 

Cost 



Figure 28-2. After the introduction of next-gen sequencing methods, the sequencing cost 
per genome has declined much more rapidly than predicted by Moore’s law. This hand- 
drawn figure reproduces a widely publicized visualization prepared by the National 
Institutes of Health. Data source: National Human Genome Research Institute. 
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Separation of Content and Design 

Good visualization software should allow you to think separately about the content 
and the design of your figures. By content, I refer to the specific dataset shown, the 
data transformations applied (if any), the specific mappings from data onto aesthet¬ 
ics, the scales, the axis ranges, and the type of plot (scatterplot, line plot, bar plot, 
boxplot, etc.). Design, on the other hand, describes features such as the foreground 
and background colors, font specifications (e.g., font size, face, and family), symbol 
shapes and sizes, whether or not the figure has a background grid, and the placement 
of legends, axis ticks, axis titles, and plot titles. When I work on a new visualization, I 
usually determine first what the content should be, using the kind of rapid explora¬ 
tion described in the previous section. Once the content is set, I may tweak the 
design, or more likely I will apply a predefined design that I like and/or that gives the 
figure a consistent look in the context of a larger body of work. 

In the software I have used for this book, ggplot2, separation of content and design is 
achieved via themes. A theme specifies the visual appearance of a figure, and it is easy 
to take an existing figure and apply different themes to it (Figure 28-3). Themes can 
be written by third parties and distributed as R packages. Through this mechanism, a 
thriving ecosystem of add-on themes has developed around ggplot2, and it covers a 
wide range of different styles and application scenarios. If you’re making figures with 
ggplot2, you can almost certainly find an existing theme that satisfies your design 
needs. 

Separation of content and design allows data scientists and designers to each focus on 
what they do best. Most data scientists are not designers, and therefore their primary 
concern should be the data, not the design of a visualization. Likewise, most designers 
are not data scientists, and they should be able to provide a unique and appealing vis¬ 
ual language for figures without having to worry about specific data, appropriate 
transformations, and so on. The same principle of separating content and design has 
long been followed in the publishing world of books, magazines, newspapers, and 
websites, where writers provide content but layout and design are handled by a sepa¬ 
rate group of people who specialize in this area and who ensure that the publication 
appears in a visually consistent and appealing style. This principle is logical and use¬ 
ful, but it is not yet that widespread in the data visualization world. 

In summary, when choosing your visualization software, think about how easily you 
can reproduce figures and redo them with updated or otherwise changed datasets, 
whether you can rapidly explore different visualizations of the same data, and to what 
extent you can tweak the visual design separately from generating the figure content. 
Depending on your skill level and comfort with programming, it may be beneficial to 
use different visualization tools at the data exploration and data presentation stages, 
and you may prefer to do the final visual tweaking interactively or by hand. If you 
have to make figures interactively, in particular with software that does not keep track 
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of all the data transformations and visual tweaks you have applied, consider taking 
careful notes on how you make each figure, so that all your work remains 
reproducible. 




Figure 28-3. Number of unemployed persons in the US from 1970 to 2015. The same 
figure is displayed using four different ggplot2 themes: (a) the default theme for this 
book; (b) the default theme ofggplot2, the plotting software I have used to make all the 
figures in this book; (c) a theme that mimics visualizations shown in the Economist; (d) 
a theme that mimics visualizations shown by FiveThirtyEight. FiveThirtyEight often 
foregoes axis labels in favor of plot titles and subtitles, and therefore I have adjusted the 
figure accordingly. Data source: US Bureau of Labor Statistics. 
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CHAPTER 29 


Telling a Story and Making a Point 


Most data visualization is done for the purpose of communication. We have an 
insight about a dataset, and we have a potential audience, and we would like to con¬ 
vey our insight to our audience. To communicate our insight successfully, we will 
have to present the audience with a meaningful and exciting story. The need for a 
story may seem disturbing to scientists and engineers, who may equate it with mak¬ 
ing things up, putting a spin on things, or overselling results. However, this perspec¬ 
tive misses the important role that stories play in reasoning and memory. We get 
excited when we hear a good story, and we get bored when the story is bad or when 
there is none. Moreover, any communication creates a story in the audiences minds. 
If we don’t provide a clear story ourselves, then our audience will make one up. In the 
best-case scenario, the story they make up is reasonably close to our own view of the 
material presented. However, it can be and often is much worse. The made-up story 
could be “this is boring,” “the author is wrong,” or “the author is incompetent.” 

Your goal in telling a story should be to use facts and logical reasoning to get your 
audience interested and excited. Let me tell you a story about the theoretical physicist 
Stephen Hawking. He was diagnosed with motor neuron disease at age 21—one year 
into his PhD—and was given two years to live. Hawking did not accept this predica¬ 
ment and started pouring all his energy into doing science. He ended up living to be 
76, became one of the most influential physicists of his time, and did all of his seminal 
work while being severely disabled. I’d argue that this is a compelling story. It’s also 
entirely fact-based and true. 
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What Is a Story? 

Before we can discuss strategies for turning visualizations into stories, we need to 
understand what a story actually is. A story is a set of observations, facts, or events, 
true or invented, that are presented in a specific order such that they create an emo¬ 
tional reaction in the audience. The emotional reaction is created through the 
buildup of tension at the beginning of the story followed by some type of resolution 
toward the end of the story. We refer to the flow from tension to resolution as the 
story arc, and every good story has a clear, identifiable arc. 

Experienced writers know that there are standard patterns for storytelling that reso¬ 
nate with how humans think. For example, we can tell a story using the Opening- 
Challenge-Action-Resolution format. In fact, this is the format I used for the Hawk¬ 
ing story. I opened the story by introducing the topic, the physicist Stephen Hawking. 
Next I presented the challenge, the diagnosis of motor neuron disease at age 21. Then 
came the action, his fierce dedication to science. Finally I presented the resolution, 
that Hawking led a long and successful life and ended up becoming one of the most 
influential physicists of his time. Other story formats are also commonly used. News¬ 
paper articles frequently follow the Lead-Development-Resolution format, or, even 
shorter, just Lead-Development, where the lead gives away the main point up front 
and the subsequent material provides further details. If we wanted to tell the Hawking 
story in this format, we might start out with a sentence such as “The influential physi¬ 
cist Stephen Hawking, who revolutionized our understanding of black holes and of 
cosmology, outlived his doctors’ prognosis by 53 years and did all of his most influen¬ 
tial work while being severely disabled.” This is the lead. In the development, we 
could follow up with a more in-depth description of Hawking’s life, illness, and devo¬ 
tion to science. Yet another format is Action-Background-Development-Climax- 
Ending, which develops the story a little more rapidly than Opening-Challenge- 
Action-Resolution but not as rapidly as Lead-Development. In this format, we might 
open with a sentence such as “The young Stephen Hawking, facing a debilitating dis¬ 
ability and the prospect of an early death, decided to pour all his efforts into his sci¬ 
ence, determined to make his mark while he still could.” The purpose of this format is 
to draw in the audience and to create an emotional connection early on, but without 
immediately giving away the final resolution. 

My goal in this chapter is not to describe these standard forms of storytelling in more 
detail. There are excellent resources that cover this material; for scientists and ana¬ 
lysts, I particularly recommend Joshua Schimel’s book Writing Science [Schimel 
2011]. Instead, I want to discuss how we can bring data visualizations into the story 
arc. Most importantly, we need to realize that a single (static) visualization will rarely 
tell an entire story. A visualization may illustrate the opening, the challenge, the 
action, or the resolution, but it is unlikely to convey all these parts of the story at 
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once. To tell a complete story, we will usually need multiple visualizations. For exam¬ 
ple, when giving a presentation, we may first show some background or motivational 
material, then a figure that creates a challenge, and eventually some other figure that 
provides the resolution. Likewise, in a research paper, we may present a sequence of 
figures that jointly create a convincing story arc. It is, however, also possible to con¬ 
dense an entire story arc into a single figure. Such a figure must contain a challenge 
and a resolution at the same time, and it is comparable to a story arc that starts with a 
lead. 

To provide a concrete example of incorporating figures into stories, I will now tell a 
story on the basis of two figures. The first creates the challenge and the second serves 
as the resolution. The context of my story is the growth of preprints in the biological 
sciences (see also Chapter 13). Preprints are manuscripts in draft form that scientists 
share with their colleagues before formal peer review and official publication. Scien¬ 
tists have been sharing manuscript drafts for as long as scientific manuscripts have 
existed. However, in the early 1990s, with the advent of the internet, physicists real¬ 
ized that it was much more efficient to store and distribute manuscript drafts in a 
central repository. They invented the preprint server, a web server where scientists 
can upload, download, and search for manuscript drafts. 

The preprint server physicists developed and still use today is called arXiv.org. 
Shortly after it was established, arXiv.org started to branch out and become popular 
in related quantitative fields, including mathematics, astronomy, computer science, 
statistics, quantitative finance, and quantitative biology. Here, I am interested in the 
preprint submissions to the quantitative biology (q-bio) section of arXiv.org. The 
number of submissions per month grew exponentially from 2007 to late 2013, but 
then the growth suddenly stopped (Figure 29-1). Something must have happened in 
late 2013 that radically changed the landscape in preprint submissions for quantita¬ 
tive biology. What caused this drastic change in submission growth? 
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Figure 29-1. Growth in monthly submissions to the quantitative biology (q-bio) section 
of the preprint server arXiv.org. A sharp transition in the rate of growth can be seen 
around 2014. While growth was rapid up to 2014, almost no growth occurred from 2014 
to 2018. Note that the y axis is logarithmic, so a linear increase in y corresponds to expo¬ 
nential growth in preprint submissions. Data source: Jordan Anaya, 
http://www.prepubmed.org/. 

I will argue that late 2013 marks the point in time when preprints took off in biology, 
and ironically this caused the q-bio archive to slow its growth. In November 2013, the 
biology-specific preprint server bioRxiv was launched by Cold Spring Harbor Labo¬ 
ratory (CSHL) Press. CSHL Press is a publisher that is highly respected among biolo¬ 
gists. The backing of CSHL Press helped tremendously with the acceptance of 
preprints in general and bioRxiv in particular among biologists. The same biologists 
that would have been quite suspicious of arXiv.org were much more comfortable with 
bioRxiv. As a result bioRxiv quickly gained acceptance among biologists, to a degree 
that arXiv had never managed. In fact, soon after its launch, bioRxiv started experi¬ 
encing rapid, exponential growth in monthly submissions, and the slowdown in q-bio 
submissions exactly coincides with the start of this exponential growth of bioRxiv 
(Figure 29-2). It appears to be the case that many quantitative biologists who other¬ 
wise might have deposited a preprint with q-bio decided to deposit it with bioRxiv 
instead. 
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Figure 29-2. The leveling off of submission growth to q-bio coincided with the introduc¬ 
tion of the bioRxiv server. Shown are the growth in monthly submissions to the q-bio 
section of the general-purpose preprint server arXiv.org and to the dedicated biology pre¬ 
print server bioRxiv. The bioRxiv server went live in November 2013, and its submission 
rate has grown exponentially since. It seems likely that many scientists who otherwise 
would have submitted preprints to q-bio chose to submit to bioRxiv instead. Data source: 
Jordan Anaya, http://www.prepubmed.org/. 


This is my story about preprints in biology. I purposefully told it with two figures, 
even though the first (Figure 29-1) is fully contained within the second (Figure 29-2). 
I think this story has the strongest impact when broken into two pieces, and this is 
how I would present it in a talk. However, Figure 29-2 alone can be used to tell the 
entire story, and the single-figure version might be more suitable to a medium where 
the audience can be expected to have short attention span, such as in a social media 
post. 

Make a Figure for the Generals 

For the remainder of this chapter, I will discuss strategies for making individual fig¬ 
ures and sets of figures that help your audience to connect with your story and 
remain engaged throughout your entire story arc. First, and most importantly, you 
need to show your audience figures they can actually understand. It is entirely possi¬ 
ble to follow all the recommendations I have provided throughout this book and still 
prepare figures that confuse. When this happens, you may have fallen victim to two 
common misconceptions: first, that the audience can see your figures and immedi¬ 
ately infer the points you are trying to make, and second, that the audience can 
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rapidly process complex visualizations and understand the key trends and relation¬ 
ships that are shown. Neither of these assumptions is true. We need to do everything 
we can to help our readers understand the meaning of our visualizations and see the 
same patterns in the data that we see. This usually means less is more. Simplify your 
figures as much as possible. Remove all features that are tangential to your story. Only 
the important points should remain. I refer to this concept as “making a figure for the 
generals.” 

For several years, I was in charge of a large research project funded by the US Army. 
For our annual progress reports, I was instructed by the program managers to not 
include a lot of figures, and that any figure I did include should show very clearly 
how our project was succeeding. A general, the program managers told me, should be 
able to look at each figure and immediately see how what we were doing was improv¬ 
ing upon or exceeding prior capabilities. Yet when my colleagues who were part of 
this project sent me figures for the annual progress report, many of the figures did 
not meet this criterion. The figures usually were overly complex, were labeled in con¬ 
fusing, technical terms, or did not make any obvious point at all. Most scientists are 
not trained to make figures for the generals. 



Never assume your audience can rapidly process complex visual 
displays. 


Some might hear this story and conclude that the generals are not very smart or just 
not that into science. I think that’s exactly the wrong take-home message. The gener¬ 
als are simply very busy. They can’t spend 30 minutes trying to decipher a cryptic fig¬ 
ure. When they give millions of dollars of taxpayer funds to scientists to do basic 
research, the least they can expect in return is a handful of clear demonstrations that 
something worthwhile and interesting was accomplished. This story should also not 
be misconstrued as being about military funding in particular. The generals are a 
metaphor for anybody you may want to reach with your visualization: a scientific 
reviewer for your paper or grant proposal, a newspaper editor, or your supervisor or 
your supervisor’s boss at the company where you’re working. If you want your story 
to come across, you need to make figures that are appropriate for your generals. 

The first thing that will get in the way of making a figure for the generals is, ironically, 
the ease with which modern visualization software allows us to make sophisticated 
data visualizations. With nearly limitless power of visualization, it becomes tempting 
to keep piling on more dimensions of data. And in fact, I see a trend in the world of 
data visualization to make the most complex, multifaceted visualizations possible. 
These visualizations may look very impressive, but they are unlikely to convey a 
meaningful story. Consider Figure 29-3, which shows the arrival delays for all flights 
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departing out of the New York City area in 2013. I suspect it will take you a while to 
process this figure. 
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Figure 29-3. Mean arrival delay versus distance from New York City. Each point repre¬ 
sents one destination, and the size of each point represents the number of flights from 
one of the three major New York City airports (Newark, JFK, or LaGuardia) to that des¬ 
tination in 2013. Negative delays imply that the flight arrived early. Solid lines represent 
the mean trends between arrival delay and distance. Delta has consistently lower arrival 
delays than other airlines, regardless of distance traveled. American has among the low¬ 
est delays, on average, for short distances, but has among the highest delays for longer 
distances traveled. This figure is labeled as “bad” because it is overly complex. Most read¬ 
ers will find it confusing and will not intuitively grasp what it is the figure is showing. 
Data source: US Dept, of Transportation, Bureau of Transportation Statistics. 

I think the most important feature of Figure 29-3 is that American and Delta have the 
shortest arrival delays. This insight is much better conveyed in a simple bar graph 
(Figure 29-4). Therefore, Figure 29-4 is the correct figure to show if the story is about 
arrival delays of airlines, even if making that graph doesn’t challenge your data visual¬ 
ization skills. And if you’re then wondering whether these airlines have small delays 
because they don’t fly that much out of the New York City area, you could present a 
second bar graph highlighting that both American and Delta are major carriers in 
this area (Figure 29-5). Both of these bar graphs discard the distance variable shown 
in Figure 29-3. This is OK. We don’t need to visualize data dimensions that are tan¬ 
gential to our story, even if we have them and even if we could make a figure that 
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showed them. Simple and clear is better than complex and confusing. When you’re 
trying to show too much data at once, you may end up not showing anything. 
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Figure 29-4. Mean arrival delay for flights out of the New York City area in 2013, by 
airline. American and Delta have the lowest mean arrival delays of all airlines flying out 
of the New York City area. Data source: US Dept, of Transportation, Bureau of Trans¬ 
portation Statistics. 
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Figure 29-5. Number of flights out of the New York City area in 2013, by airline. Delta 
and American are the fourth and fifth largest carriers by flights out of the New York City 
area. Data source: US Dept, of Transportation, Bureau of Transportation Statistics. 
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Build Up Toward Complex Figures 

Sometimes, however, we do want to show more complex figures that contain a large 
amount of information at once. In those cases, we can make things easier for our 
readers if we first show them a simplified version of the figure before we show the 
final one in its full complexity. The same approach is also strongly recommended for 
presentations. Never jump straight to a highly complex figure; first show an easily 
digestible subset of the information. 

This recommendation is particularly relevant if the final figure is a small multiples 
plot (Chapter 21) showing a grid of subplots with similar structure. The full grid is 
much easier to digest if the audience has first seen a single subplot by itself. For exam¬ 
ple, Figure 29-6 shows the aggregate numbers of United Airlines departures out of 
Newark Airport (EWR) in 2013, broken down by weekday. Once we have seen and 
digested this figure, it’s much easier to process the same information for 10 airlines 
and 3 airports at once (Figure 29-7). 


United departures, EWR 



M T W T F S S 
weekday 


Figure 29-6. United Airlines departures out of Newark Airport (EWR) in 2013, by week¬ 
day. Most weekdays show approximately the same number of departures, but there are 
fewer departures on weekends. Data source: US Dept, of Transportation, Bureau of 
Transportation Statistics. 
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Transportation, Bureau of Transportation Statistics. 
















Make Your Figures Memorable 

Simple and clean figures such as simple bar plots have the advantage that they avoid 
distractions, are easy to read, and let your audience focus on the most important 
points you want to bring across. However, the simplicity can come with a disadvan¬ 
tage: figures can end up looking generic. They don’t have any features that stand out 
and make them memorable. If I showed you 10 bar graphs in quick succession you’d 
have a hard time keeping them apart and afterwards remembering what you saw. For 
example, if you take a quick look at Figure 29-8, you will notice the visual similarity 
to Figure 29-5 from earlier in this chapter. However, the two figures have nothing in 
common other than that they are bar graphs. Figure 29-5 showed the number of 
flights out of the New York City area by airline, whereas Figure 29-8 shows the most 
popular pets in US households. Neither figure has any element that helps you intui¬ 
tively perceive what topic the figure covers, and therefore neither figure is particu¬ 
larly memorable. 
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Figure 29-8. Number of households having one or more of the most popular pets: dogs, 
cats, fish, or birds. This bar graph is perfectly clear but not necessarily particularly mem¬ 
orable. The “cats" column has been highlighted solely to create visual similarity with 
Figure 29-5. Data source: 2012 US Pet Ownership & Demographics Sourcebook, Ameri¬ 
can Veterinary Medical Association. 

Research on human perception shows that more visually complex and unique figures 
are more memorable [Bateman et al. 2010]; [Borgo et al. 2012]. However, visual 
uniqueness and complexity do not just affect memorability, as they may hinder a per¬ 
son’s ability to get a quick overview of the information or make it difficult to distin¬ 
guish small differences in values. At the extreme, a figure could be very memorable 
but utterly confusing. Such a figure would not be a good data visualization, even if it 
works well as a stunning piece of art. At the other extreme, figures may be very clear 
but forgettable and boring, and those figures may not have the impact we might hope 
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for either. In general, we want to strike a balance between the two extremes and make 
our figures both memorable and clear. (The intended audience matters as well, how¬ 
ever. If a figure is intended for a technical scientific publication, we will generally 
worry less about memorability than if the figure is intended for a broadly read news¬ 
paper or blog.) 

We can make a figure more memorable by adding visual elements that reflect features 
of the data, such as drawings or pictograms of the things or objects that the dataset is 
about. One approach that is commonly taken is to show the data values themselves in 
the form of repeated images, such that each copy of an image corresponds to a 
defined amount of the represented variable. For example, we can replace the bars in 
Figure 29-8 with repeated images of a dog, a cat, a fish, and a bird, drawn to a scale 
such that each complete animal corresponds to 5 million households (Figure 29-9). 
Thus, visually, Figure 29-9 still functions as a bar plot, but we now have added some 
visual complexity that makes the figure more memorable, and we have also shown 
the data using images that directly reflect what the data means. After only a quick 
glance at the figure, you may be able to remember that there were many more dogs 
and cats than fish or birds. Importantly, in such visualizations, we want to use the 
images to represent the data, rather than using images simply to adorn the visualiza¬ 
tion or to annotate the axes. In psychological experiments, the latter choices tend to 
be distracting rather than helpful [Haroz, Kosara, and Franconeri 2015]. 
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Figure 29-9. Number of households having one or more of the most popular pets, shown 
as an isotype plot. Each complete animal represents 5 million households that have that 
kind of pet. Data source: 2012 US Pet Ownership & Demographics Sourcebook, Ameri¬ 
can Veterinary Medical Association. 
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Visualizations such as Figure 29-9 are often called isotype plots. The word “isotype” 
was introduced as an acronym of International System Of Typographic Picture Edu¬ 
cation, and strictly speaking it refers to logo-like simplified pictograms that represent 
objects, animals, plants, or people [Haroz, Kosara, and Franconeri 2015]. However, I 
think it makes sense to use the term isotype plot more broadly to apply to any type of 
visualization where repeated copies of the same image are used to indicate the magni¬ 
tude of a value. After all, the prefix “iso” means “the same” and “type” can mean a 
particular kind, class, or group. 

Be Consistent but Don't Be Repetitive 

When discussing compound figures in Chapter 21,1 mentioned that it is important to 
use a consistent visual language for the different parts of a larger figure. The same is 
true across figures. If we make three figures that are all part of one larger story, then 
we need to design those figures so they look like they belong together. Using a consis¬ 
tent visual language does not mean, however, that everything should look exactly the 
same. On the contrary, it is important that figures describing different analyses look 
visually distinct, so that your audience can easily recognize where one analysis ends 
and another one starts. This is best achieved by using different visualization 
approaches for different parts of the overarching story. If you have used a bar plot 
already, next use a scatterplot, or a boxplot, or a line plot. Otherwise, the different 
analyses will blur together in your audiences mind, and your audience will have a 
hard time distinguishing one part of the story from another. For example, if we rede¬ 
sign Figure 21-8 from “Compound Figures” on page 260 so it uses only bar plots, the 
result is noticeably less distinct and more confusing (Figure 29-10). 

When preparing a presentation or report, aim to use a different 
type of visualization for each distinct analysis. 
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Figure 29-10. Physiology and body composition of male and female athletes. Error bars 
indicate the standard error of the mean. This figure is overly repetitive. It shows the same 
data as Figure 21-8 and it uses a consistent visual language, but all the subfigures use 
the same type of visualization (bar plots). This makes it difficult for the reader to process 
that parts (a), (b), and (c) show entirely different results. Data source: [Telford and Cun¬ 
ningham 1991], 


Sets of repetitive figures are often a consequence of multipart stories where each part 
is based on the same type of raw data. In those scenarios, it can be tempting to use the 
same type of visualization for each part. However, in aggregate, these figures will not 
hold the audience’s attention. As an example, let’s consider a story about the price of 
Facebook stock, in two parts: (i) the Facebook stock price has increased rapidly from 
2012 to 2017, and (ii) the price increase has outpaced that of other large tech compa¬ 
nies. You might want to visualize these two statements with two figures showing stock 
price over time, as demonstrated in Figure 29-11. However, while Figure 29-1 la 
serves a purpose and should remain as is, Figure 29-lib is at the same time repetitive 
and obscures the main point. We don’t particularly care about the exact temporal evo¬ 
lution of the stock price of Alphabet, Apple, or Microsoft; we just want to highlight 
that it grew less than the stock price of Facebook. 
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Figure 29-11. Growth of Facebook stock price over a five-year interval and comparison 
with other tech stocks, (a) The Facebook stock price rose from around $25/'share in 
mid-2012 to $150/share in mid-2017. (b) The prices of other large tech companies did 
not rise comparably over the same time period. Prices have been indexed to 100 on June 
1, 2012 to allow for easy comparison. This figure is labeled as “ugly” because parts (a) 
and (b) are repetitive. Data source: Yahoo! Finance. 


I would recommend to leave part (a) as is but replace part (b) with a bar plot showing 
percent increase (Figure 29-12). Now we have two distinct figures that each make a 
unique point and that work well in combination. Part (a) allows the reader to get 
familiar with the raw, underlying data and part (b) highlights the magnitude of the 
effect while removing any tangential information. 
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Figure 29-12. Growth of Facebook stock price over a five-year interval and comparison 
with other tech stocks, (a) The Facebook stock price rose from around $25/share in 
mid-2012 to $150/share in mid-2017, an increase of almost 450%. (b) The prices of 
other large tech companies did not rise comparably over the same time period. Price 
increases ranged from around 90% to almost 240%. Data source: Yahoo! Finance. 


Figure 29-12 highlights a general principle that I follow when preparing sets of fig¬ 
ures to tell a story: I start with a figure that is as close as possible to showing the raw 
data, and in subsequent figures I show increasingly more derived quantities. Derived 
quantities (such as percent increases, averages, coefficients of fitted models, and so 
on) are useful to summarize key trends in large and complex datasets. However, 
because they are derived they are less intuitive, and if we show a derived quantity 
before we have shown the raw data, our audience will find it difficult to follow. On the 
flip side, if we try to show all trends by showing raw data, we will end up needing too 
many figures and/or being repetitive. 

How many figures should you use to tell your story? The answer depends on the pub¬ 
lication venue. For a short blog post or tweet, make one figure. For scientific papers, I 


348 | Chapter 29: Telling a Story and Making a Point 







recommend between three and six figures. If there are many more than six figures for 
a scientific paper, then some of them may need to be moved into an appendix or sup¬ 
plementary materials section. It is good to document all the evidence we have collec¬ 
ted, but we must not wear out our audience by presenting excessive numbers of 
mostly similar-looking figures. In other contexts, a larger number of figures may be 
appropriate. However, in those contexts, we will usually be telling multiple stories, or 
an overarching story with subplots. For example, if I am asked to give an hour-long 
scientific presentation, I usually aim to tell three distinct stories. Similarly, a book or 
thesis will contain more than one story, and in fact may contain one story per chapter 
or section. In those scenarios, each distinct story line or subplot should be presented 
with no more than three to six figures. In this book, you will find that I follow this 
principle at the level of sections within chapters. Each section is approximately self- 
contained and will typically show no more than six figures. 
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Annotated Bibliography 


No single book can cover everything there is to know about a topic. I encourage you 
to read other texts on data visualization to deepen your understanding and to develop 
your technical skills in making figures. Here, I provide a limited selection of books 
that I have personally found interesting, thought-provoking, or helpful. Books listed 
in the first section are the most similar in scope to the present book, and may provide 
complementary or alternative perspectives on the topics I have covered. Books listed 
in “Programming Books” on page 352 address the important topic of how to make 
visualizations using programming approaches and available software libraries. The 
remaining sections list other books that will expand your knowledge of data visualiza¬ 
tion and help you communicate with visuals and data. 

Thinking About Data and Visualization 

The following books discuss the thought processes and decision making required for 
turning data into visualizations. They serve as introductory texts on how to choose 
what visualizations to make and what pitfalls to look out for: 

Alberto Cairo. The Truthful Art. New Riders, 2016. 

An excellent all-around introduction to data visualization, in particular for jour¬ 
nalists. The book covers many important concepts of data visualization, such as 
how to visualize distributions, trends, uncertainty, and maps. In many chapters, it 
also serves as an introduction to basic statistical principles, explaining concepts 
such as populations, samples, and confidence levels. 

Stephen Few. Show Me the Numbers. Analytics Press, 2012. 

A book about data visualization for the business professional. It is similar in 
scope and target audience to the following reference but contains more material 
and covers many topics in more depth. However, it is not as well written or care¬ 
fully produced as the following book. 
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Cole Nussbaumer Knaflic. Storytelling with Data. John Wiley & Sons, 2015. 

A well-written and carefully produced book on how to turn data into visuals. The 
book’s primary audience is people making business graphics, and it’s an excellent 
reference for the topics it covers. However, it does not cover many topics of 
importance to scientists, such as the visualization of distributions, trends, or 
uncertainty. 

Programming Books 

The following references are all how-to books that teach programming approaches to 
data visualization: 

Kieran Healy. Data Visualization: A Practical Introduction. Princeton University Press, 
2018. 

An introduction to using ggplot2 for data visualization. Recommended as follow¬ 
up after Wickham and Grolemund’s R for Data Science (mentioned later in this 
list). 

Scott Murray. Interactive Data Visualization for the Web: An Introduction to Design¬ 
ing with D3. 2nd ed. O’Reilly Media, 2017. 

An introduction to making interactive online visualizations with D3, using 
HTML, CSS, JavaScript, and SVG. 

Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with 
Data. O’Reilly Media, 2016. 

An introduction to using the programming language Python for data science. 
Has extensive material on data visualization using Python’s Matplotlib and 
Seaborn. 

Hadley Wickham, Garrett Grolemund. R for Data Science. O’Reilly Media, 2017. 

An all-around introduction to using the programming language R for data sci¬ 
ence. Contains several chapters on using ggplot2 for data visualization. 

Statistics Texts 

Introductory texts in statistics will generally contain material on data visualization, 
covering topics such as scatterplots, histograms, boxplots, and line graphs. There are 
many such texts that could be listed. Here, I mention just a few recent additions that 
are worth a look: 

David M. Diez, Christopher D. Barr, Mine (fetinkaya-Rundel. Openlntro Statistics. 3rd 
ed. Openlntro, Inc., 2015. 

An open source introductory statistics text book. The entire book is freely avail¬ 
able, as are the LaTeX files and R code used to compile the book and make the 
figures. 
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Susan Holmes, Wolfgang Huber. Modern Statistics for Modern Biology. Cambridge 
University Press, 2018. 

A statistics text that emphasizes computational tools needed for modern biology. 
The entire book is freely available, and R code for all examples is provided. 

Chester Ismay, Albert Y. Kim. Modern Dive—An Introduction to Statistical and Data 
Sciences via R. https://moderndive.com. 

An online-only introductory textbook that teaches basic statistics and data sci¬ 
ence. The book covers both theoretical concepts and practical approaches using 

R. 

Historical Texts 

The books in this section are of interest primarily for historical reasons. They were 
influential at the time of their publication, but similar material can now be found 
elsewhere or in more modern form: 

William S. Cleveland. The Elements of Graphing Data. 2nd ed. Hobart Press, 1994. 

One of the first books about information design written for statisticians. The 
book contains many examples of scatterplots, line graphs, histograms, and box- 
plots, and it discusses them in the context of data analysis and statistical model¬ 
ing. It also popularized the Cleveland dot plot. 

William S. Cleveland. Visualizing Data. Hobart Press, 1993. 

Companion book to The Elements of Graphing Data by the same author. This one 
is more mathematical and doesn’t talk about human perception. 

Edward R. Tufte. Envisioning Information. Graphics Press, 1990. 

This book popularized the concept of the small multiple. 

Edward R. Tufte. The Visual Display of Quantitative Information. 2nd ed. Graphics 
Press, 2001. 

First published in 1983, this book has been highly influential in the field of data 
visualization. It introduced concepts such as chart junk, data-ink ratio, and 
sparklines. The book also showed the first slopegraph (but didn’t name it). How¬ 
ever, it does contain several recommendations that have not stood the test of 
time. In particular, it recommends an excessively minimalistic plot design. 
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Books on Broadly Related Topics 

The following books are all broadly related to the topics of data visualization and 

effective communication: 

Joshua Schimel. Writing Science. Oxford University Press, 2011. 

Teaches how to write about scientific and other technical topics in an engaging 
way, by telling a story. While not primarily a book about data visualization, this is 
an indispensable text for anybody who needs to write technical articles and/or 
proposals. 

Jonathan Schwabish. Better Presentations. Columbia University Press, 2016. 

A short and informative guide for making presentations. A must-read for any¬ 
body who routinely uses slides to give talks or presentations. 

Maureen C. Stone. A Field Guide to Digital Color. A K Peters, 2003. 

A comprehensive guide to how colors are captured, processed, and reproduced 
by computers. 

Colin Ware. Information Visualization. 3rd ed. Morgan Kaufmann, 2012. 

A book about principles of visualization, specifically addressing topics such as 
how the human visual system works and how different graphical patterns are 
perceived. The book covers many different visualization scenarios, including user 
interfaces and virtual worlds, but it puts comparatively less emphasis on visualiz¬ 
ing data in the form of 2D figures. 
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Technical Notes 


The entire book was written in R Markdown, using the bookdown, markdown, and 
kni.tr packages. All figures were made with ggplot2, with the help of several add-on 
packages including cowplot, geofacet, ggforce, ggmap, ggrepel, ggridges, hexbin, 
patchwork, sf, statebins, tidybayes, and treemapify. Color manipulations were 
done with the colorspace and colorblindr packages. For many of these packages, 
the current development version is required to compile all parts of the book. 

The source code for the book is available at https://github.com/clauswilke/dataviz. The 
book also requires a supporting R package, dviz.supp, whose code is available at 
https://github.com/clauswilke/dviz.supp. 

The book was last compiled using the following environment: 


## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 

## 


R version 3.5.0 (2018-04-23) 

Platform: x86_64-apple-darwinl5.6.0 (64-bit) 

Running under: macOS Sierra 10.12.6 

Matrix products: default 

BLAS: /Library/Frameworks/ ... /libRblas.0.dylib 
LAPACK: /Library/Frameworks/ ... /libRlapack.dylib 

locale: 

[1] en_US.UTF-8/en_US.UTF-8/ ... /C/en_US.UTF-8/en_US.UTF-8 
attached base packages: 

[1] stats graphics grDevices utils datasets methods base 


other attached packages: 
[1] nycflightsl3_1.0.0 
[4] gganimate_l.0.0.9000 
[7] mgcv_1.8-24 
[10] tidybayes_1.0.3 
[13] sf_0.7-l 
[16] rgeos_0.3-28 
[19] plot3D_l.1.1 


gapminder_0.3.0 

ungeviz_0.1.0 

nlme_3.1-137 

maps_3.3.0 

maptools_0.9-4 

ggspatial_1.0.3 

magick_1.9 


RColorBrewer_l.1-2 

emmeans_l.3.1 

broom_0.5.1 

statebins_2.0.0 

sp_1.3-l 

geofacet_0.1.9 

hexbin 1.27.2 
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## 

[22] 

treemaplfy_2.5.0 

gridExtra_2.3 ggmap_2.7.904 

## 

[25] 

ggthemes_4.0.1 

ggrtdges_0.5.1 ggrepel_0.8.0 

## 

[28] 

ggforce_0.1.1 

patchwork_0.0.1 lubridate_1.7.4 

## 

[31] 

forcats_0.3.0 

strlngr_l.3.1 purrr_0.2.5 

## 

[34] 

readr_l.1.1 

tidyr_0.8.2 tlbble_1.4.2 

## 

[37] 

tidyverse_1.2.1 

dviz.supp_0.1.0 dplyr_0.8.0.9000 

## 

[40] 

colorblindr_0.1.0 

ggplot2_3.1.0 colorspace_1.4-0 

## 

[43] 

cowplot_0.9.99 


## 




## 

loaded via a namespace 

(and not attached): 

## 

[1] 

rjson_0.2.20 

deldir_0.1-15 

## 

[3] 

class_7.3-14 

rprojroot_1.3-2 

## 

[5] 

estimability_1.3 

ggstance_0.3.1 

## 

[7] 

rstudioapi_0.7 

farver_l.0.0.9999 

## 

[9] 

ggflttext_0.6.0 

svllntt_0.7-12 

## 

[11] 

mvtnorm_1.0-8 

xml2_1.2.0 

## 

[13] 

knitr_1.20 

polyclip_1.9-l 

## 

[15] 

jsonlite_l.5 

png_0.1-7 

## 

[17] 

compller_3.5.0 

httr_1.3.1 

## 

[19] 

backports_1.1.2 

assertthat_0.2.0 

## 

[21] 

Matri.x_l.2-14 

lazyeval_0.2.1 

## 

[23] 

cli_l.0.1.9000 

tweenr_l.0.1 

## 

[25] 

prettyunits_1.0.2 

htmltools_0.3.6 

## 

[27] 

tools_3.5.0 

misc3d_0.8-4 

## 

[29] 

coda_0.19-2 

gtable_0.2.0 

## 

[31] 

glue_1.3.0 

Rcpp_1.0.0 

## 

[33] 

cellranger_l.1.0 

imguR_1.0.3 

## 

[35] 

xfun_0.3 

strapgod_0.0.0.9000 

## 

[37] 

rvest_0.3.2 

MASS_7.3-50 

## 

[39] 

scales_1.0.0 

hms_0.4.2 

## 

[41] 

yaml_2.2.0 

stringi_1.2.4 

## 

[43] 

el071_1.7-0 

spData_0.2.9.4 

## 

[45] 

RgoogleMaps_l.4.3 

rlang_0.3.0.1 

## 

[47] 

pkgconflg_2.0.2 

bltops_1.0-6 

## 

[49] 

geogrid_0.1.1 

evaluate_0.11 

## 

[51] 

lattice_0.20-35 

tldyselect_0.2.5 

## 

[53] 

plyr_1.8.4 

magrittr_1.5 

## 

[55] 

bookdown_0.7 

R6_2.3.0 

## 

[57] 

generics_0.0.2 

DBI_1.0.0 

## 

[59] 

pillar_1.3.0 

haven_l.1.2 

## 

[61] 

forelgn_0.8-71 

withr_2.1.2.9000 

## 

[63] 

untts_0.6-l 

modelr_0.1.2 

## 

[65] 

crayon_1.3.4 

arrayhelpers_l.0-20160527 

## 

[67] 

rmarkdown_l.10 

prog ress_l.2.0.9000 

## 

[69] 

jpeg_0.1-8 

rnaturalearth_0.1.0 

## 

[71] 

grid_3.5.0 

readxl_1.1.0 

## 

[73] 

digest_0.6.18 

classInt_0.2-3 

## 

[75] 

xtable_1.8-3 

munsell_0.5.0 

## 

[77] 

concaveman_1.0.0 
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Index 


Symbols 

2D bins, 41 

2D error bars, 43 

2D histograms, 222-225 

2D, projection of 3D objects into, 305, 309 

3D plots, 305-314 

appropriate use of, 313-314 
avoiding 3D position scales, 307-313 
avoiding gratuitous 3D, 305-307 
distortion from projecting objects into 2D, 
305 

problems with 3D bars, 306, 312 

A 

absolute number vs. percentages, 101 
accent color scales, 33 

Action-Background-Development-Climax- 
Ending story format, 334 
aesthetics, 7-12, 328 

applied to maps of geospatial data, 171 
commonly used in data visualizations, 7 
encoding single variable in multiple aesthet¬ 
ics, legends and, 253 
mapping data onto with scales, 10-12 
representing continuous or discrete data, 8 
types of data in visualizations, 8 
age pyramids, 68 
Albers projection, 167 
amounts, visualizing, 37, 45-58 
bar plots, 45-50 

grouped and stacked bars, 50-53 
dot plots and heat maps, 53-58 
on log scales, problems with, 212 
anticorrelated variables, 121 


associations, visualizing, 117-129 
correlograms, 121-124 
dimension reduction, 124-127 
paired data, 127-129 
scatterplots, 117-121 
automation, xiii 

programmatically generated figures, 327 
axes, 16-24 

axis ranges in small multiples, 257 
coordinate systems with curved axes, 22-24 
nonlinear, 16-22 
omitting in plots, 272 
titles, 270-272 

example plot with appropriate titles, 270 
omitting, 270 
overdoing, 272 

visualizing many distributions along the 
horizontal axis, 88-91 
visualizing many distributions along the 
vertical axis, 81-87 
axis labels, 270 

(see also axes, titles) 
larger, using, 291-295 

B 

background grids, 279, 282-287 

along both axes, in scatterplots, 287 
for figures about change in y axis values, 
284 

grid lines running perpendicular to key 
variable of interest, 286 
horizontal reference lines, 284 
in bar plots, 286 
in ggplot2 software, 282 
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in scatterplots of paired data, using diagonal 
line with, 289 

recommendations on use, 290 
removing gray background and grid lines, 
284 

using diagonal line instead of in scatterplots 
of paired data, 287 

using gray background and grid lines, 282 
backgrounds 

recommendations on use, 290 
removing gray background and grid lines, 
284 

separating boxes in boxplots from, 302 
using gray background and grid lines, 282 
bad figures, 2 

balancing data and context, 277-290 
background grids, 282-287 
for figures with paired data, 287-290 
providing appropriate amount of context, 
277-282 

figure with too much non-data ink, 278 
figures with too little non-data ink, 281 
bandwidth, 61 
bar charts, 46 

(see also bar plots) 
bar plots 

background grid in, 286 
flawed approach to visualizing nested pro¬ 
portions, 106 

limitations of stacked bars, 110 
line drawings in, 297 
omitting axes, 272 
on log scales, 212 

visualizing amounts, 212 
visualizing ratios, 214 
3D bars, problems with, 306, 311 
visualizing amounts, 45-50 
grouped bar plots, 50 
ordering of bars, 48 
problems with long labels for vertical 
bars, 46 

stacked bar plots, 52 
with horizontal bars, 48 
with vertical bars, 46 
visualizing proportions, 94 

case for side-by-side bars, 97-99 
cased for stacked bars, 99-100 
vs. pie charts, 216 
with error bars, 193 


bars 

in visualizations along linear axes, 211 
on linear scales, 208 
visualizing amounts with, 37 
visualizing proportions with, 39 
Bayesian statistics, 187 

assessment of uncertainty, 194 
conceptual advantage of this approach, 196 
bibliography, 351-354 
binning data 

bins in histograms, 59-61 
choosing width of, 60 
into hexagonal bins, 224 
bitmap graphics, 319-320 

JPEG and PNG file formats, 319, 322 
lossless and lossy compression, 321-322 
saving vector graphics to, 324 
blue-yellow color-vision deficiency, 238, 244 
boxplots, 39 

shading for boxes, 302 
visualizing many distributions, 83-84 
bubble charts, 41 

visualizing associations among quantitative 
variables, 118 


captions, 267-269 

figure title as first element in, 269 
incorporating into main display, 268 
location relative to display items in figures 
and tables, 275 
Cartesian coordinates, 13-16 
cartograms, 42, 161, 176-177 
heatmap, 177 
with distorted shapes, 176 
categorical data, 9 

(see also qualitative data) 
use of colors with, 27, 234 
choropleths, 30 

choropleth maps, 161, 172-176 
using in maps, 42 
color, 7 

as a tool to distinguish, 27-28 
as a tool to highlight, 33-34 
as representation of data values, 29-32 
common pitfalls of use, 233-242 

encoding too much or irrelevant infor¬ 
mation, 233-237 
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using nonmonotonic color scales to 
encode data values, 237-238 
encoding categorical variable by bar color, 
52 

fundamental uses in data visualizations, 27 
in photographic images, 322 
line drawings and, 297 
manipulations, packages used, 355 
mapping data values onto colors in heat- 
maps, 56 

redundant coding, 243 

(see also redundant coding) 
using gray background for figures, 283 
using in visualizations of geospatial data, 42 
color scales, 12, 27-34 
accent, 33 
diverging, 30 
qualitative, 27 
sequential, 29 

color-vision deficiency, 238-242, 243-253 
deuteranomaly/deuteranopia, 238, 244 
protanomaly/protanopia, 238, 244 
tritanomaly/tritanopia, 238, 244 
ColorBrewer color scales, 27, 30 
compound figures, 255, 260-264 

alignment of individual figure panels, 264 
how individual panels fit together, 261 
labeling of individual panels, 260 
compression, lossless and lossy, of bitmap 
graphics, 321-322 
lossy compression in JPEGs, 324 
confidence bands, 44, 181, 197 
graded, 199 

confidence interval, 194 
confidence levels, 199 
confidence strips, 43 

as alternative to error bars, 193 
conformal projection, 163 
connected scatterplots, 42 

visualizing high-dimensional time series 
using PCA, 141 

visualizing time series with two or more 
response variables, 139 

consistency, achieving without being repetitive, 
345-349 

content and design, separation of, 330-331 
context in figures, 277 

(see also balancing data and context) 
continuous data, 8 


contour lines, 41, 225-231 

drawing each in its own panel, 230 
drawing multiple sets in different colors, 
228 

coordinate systems and axes, 13-24 
Cartesian, 13-16 

coordinate systems with curved axes, 22-24 
nonlinear axes, 16-22 
coordinates, 7 

correlated or anticorrelated variables, 121 
correlation coefficients 
calculating, 121 
plotting as correlograms, 41 
correlograms, 41 
drawback of, 124 

visualizing associations among quantitative 
variables, 121-124 
credible intervals, 194 
cumulative densities, 38 
cumulative distribution, 72 

(see also empirical cumulative distribution 
functions) 

curve fits, visualizing uncertainty of, 197-201 
CVD (see color-vision deficiency) 

D 

data 

categories of, 8 
in figures, 277 

(see also balancing data and context) 
too much data in figures, 338 
data exploration vs. data presentation, 327-330 
data source statements, 268 
data transformations (for 3D visualizations), 
309 

data values, using color to represent, 29-32, 42 
data visualization software 

autogeneration of legends, 248 
choosing, 325-331 

data exploration vs. data presentation, 
327-330 

reproducibility and repeatability, 
326-327 

separation of content and design, 
330-331 

smoothing features, 150 
thoughts on graphing software and figure 
preparation pipelines, xii 
data-ink ratio, 277 
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dates and time, 9 
decomposition of time series, 157 
density plots 

direct labeling curves instead of using labels, 

251 

line drawings in, 299 

using as legend replacement in scatterplots, 

252 

visualizing a single distribution, 61-64 
choosing between density plots and his¬ 
tograms, 64 

kernel density and bandwidth, 62 
visualizing distributions, 38 
visualizing multiple distributions, 66-68 
visualizing proportions 
faceted densities, 103 
stacked densities, 100 
design and content, separation of, 330-331 
deterministic construal errors, 190 
detrending and time-series decomposition, 
155-160 

deuteranomaly/deuteranopia, 238, 244 
dimension reduction, 124-127 

using with high-dimensional time series in 
connected scatterplot, 141 
direct area visualizations, 215-217 
pie charts, 215 
direct labeling 

as alternative to legends, 250 
as alternative to using colors, 234 
directory of visualizations, 37-44 
amounts, 37 
distributions, 38 
proportions, 39 
relationships, 41 
uncertainty, 43 
discrete data, 8 

discrete outcome visualization, 182 
distinguishing discrete items or groups, using 
color, 27-28 

distributions, visualizing, 38 

empirical cumulative distribution functions 
and q-q plots, 71-79 
highly skewed distributions, 74-78 
histograms and density plots, 59-68 

visualizing a single distribution, 59-64 
visualizing multiple densities simultane¬ 
ously, 64-68 

many distributions simultaneously, 81-91 


along the horizontal axis, 88-91 
along the vertical axis, 81-87 
diverging color scales, 30, 239 
example of, 32 

dose-response curves, visualizing with multiple 
time series, 135-138 
dot plots 

on log scales, 214 

points too small compared with text, 293 
quantile, 43 

visualizing a time series, 131 

connecting dots with a line, 132 
visualizing amounts, 37, 53-56 
dviz.supp (R package), 355 

E 

empirical cumulative distribution functions 
(ECDFs), 71-74 

ascending and descending ECDFs, 72 
highly skewed distributions, 74-78 
normalized ranks, 73 

Environmental Systems Research Institute 
(ESRI), 163 

equal-area projection, 163 
Albers projection, 167 
equator, 161 
error bars, 43, 82, 181 

advantage over other more complex displays 
of uncertainty, 193 
graded, 189 
in ridgeline plots, 197 
limitations of, 188 
variations, 191 
estimates, 186 
visualizing, 43 

(see also error bars; uncertainty, visualiz¬ 
ing) 

European Petroleum Survey Group (EPSG), 

163 

exploration of data (see data exploration vs. 

data presentation) 
eye, xi 

eye plots, 43, 197 

F 

faceting, 255 
factors, 8, 9, 11 
filling, 297 
fitted draws, 44 
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frequencies, framing probabilities as, 181-185 
frequentist statistics, 187 

G 

Gaussian kernel, 61 

generalized additive models (GAMs), 150 
geographic regions 

showing how data values vary across, 30 
square-root scales for, 21 
geospatial data, 161-177 
showing in maps, 42 
visualizing in cartograms, 176-177 
visualizing in choropleth maps, 172-176 
visualizing with map projections, 161-168 
ggplot2, xii 

and add-on packages used in this book, 355 
background grids, 282 
facet_grid function, 255 
hue scale, 27 

separation of content and design via themes, 
330 

Goode hemolosine projection, 165 
graded confidence bands, 44, 199 
graded error bars, 43, 189 

benefits and disadvantages of, 192 
grids, background (see background grids) 
grouped bar plots, 50 
grouping variables, 81 

H 

half-eyes, 43, 197 

hand-drawn or manually post-processed fig¬ 
ures, 329 
hatching, 297 
heatmaps, 38 
cartogram, 177 
visualizing amounts, 56-58 
hex bins, 41 

highlighting specific elements in data, using 
color, 33-34 

highly skewed distributions, 74-78 
histograms, 38, 183 
2D, 222-225 
in ridgeline plots, 89 
line drawings in, 297 
stacked, 39 

visualizing a single distribution, 60-61 
choosing between density plots and his¬ 
tograms, 64 


visualizing multiple distributions with 
stacked histograms, 64-66 
historical texts, 353 

HOPs (see hypothetical outcome plots) 
human perception 

color-vision deficiency, 238 
judging distances better than areas, 216 
lossy compression and, 322 
more visually complex and unique figures 
are memorable, 343 
hypothetical outcome plots, 201-203 

I 

image hie formats, 319-324 

bitmap and vector graphics, 319-321 
converting between formats, 324 
lossless and lossy compression of bitmap 
graphics, 321-322 

image manipulation or illustration programs, 
328 

incorporating figures into stories, 335 
intensity of colors, 236 
isotype plots, 345 

J 

jittering, 85 

using with overlapping points, 220 
JPEG files, 322 

saving to/from other formats, 324 

K 

kernel density estimation, 61, 183 
pitfall, 63 

knots (in splines), 150 

L 

labels 

axis, 270, 291-295 
(see also axes, titles) 
for vertical bars in bar plots, 46 
panels in compound figures, 261 
using direct labeling instead of colors for 
more than 8 categories, 234 
using direct labeling instead of legends, 250 
legends 

designing figures without, 250-253 
designing with redundant coding, 243-250 
in compound figures, 263 
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labels too small, 291 
titles, 270-272 

example plot with appropriate titles, 270 
omitting, 270 
overdoing, 272 
levels (of a factor), 8,11 
line drawings, avoiding, 297-302 
line graphs, 42 

uses other than visualizing time series, 137 
visualizing a time series, 132 
visualizing dose-response curve, 137 
visualizing multiple time series, 135 
visualizing time series with two or more 
response variables using two graphs, 138 
linear axes, visualizations along, 208-212 
bars, 208 
time series, 209 

linear relationships between variables, 152 
linear scales, 16 
ratios on, 19 

relationship to square root scales, 20 
linear-log plots, 155 
lines, width of, 7 

LOESS (locally estimated scatterplot smooth¬ 
ing), 148 

seasonal decomposition of time series by, 
158 

log (logarithmic) scales, 17, 154 

datasets with numbers of very different 
magnitudes, 20 
log-linear plots, 155 
log-log plots, 155 
log-normal distributions, 75, 79 
logarithmic axes, visualizations along, 212-215 
bar graphs on log scales, 212 
lossless compression, 321 
lossy compression, 321 

in JPEG images, 322, 324 
key idea of, 322 

M 

manual steps in figure preparation, 328 
maps, 42, 161 

cartograms, 176-177 
choropleth, 172-176 

of multiple layers of different types of infor¬ 
mation, 169 
projections, 161-168 
mean, 186 


memorable figures, creating, 343-345 
Mercator projection, 163 
mosaic plots, 40 
limitations of, 110 
omitting axes, 272 
treemapsvs., 109 

visualizing nested proportions, 107 
multipanel figures, 255-264 
compound figures, 260-264 
delineating plot panels with frames, 279 
drawing contour lines in separate panels, 
230 

small multiples, 255-260 
using small multiples instead of 3D plots, 
312 

N 

nested proportions, 105-115 

flawed approaches to visualizing, 105-107 
visualizing with mosaic plots and treemaps, 
107-110 

visualizing with nested pie charts, 111-112 
visualizing with parallel sets, 113-115 
nonlinear relationships in data, functional form 
for, 152 

nonmonotonic color scales, 237 
null hypothesis, 194 
numerical data, 9 

0 

Okabe Ito scale, 27, 241 

Opening-Challenge-Action-Resolution story 
format, 334 
ordered factors, 9, 11 
overlapping densities, 39 
overlapping points, handling, 219-231 
using 2D histograms, 222-225 
using contour lines, 225-231 
using partial transparency and jittering, 
219-222 

overplotting, 219 

P 

paired data, 41, 127-129 

balancing data and context in visualizations, 
287-290 

visualizing with scatterplot, 128 
visualizing with slopegraph, 129 
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parallel sets, 40 

visualizing nested proportions, 113-115 
parameters, 186 

partial densities, visualizing proportions, 103 
partial transparency, using for overlapping 
dots, 220 
PDF files 

rendering time for large files, 321 
saving to/from other formats, 324 
percentages vs. absolute number, 101 
phase portraits, 139 

(see also connected scatterplots) 
pie charts 

bar plots vs., 216 
direct area visualizations, 215 
gratuitous use of 3D with, 305 
invalid approach to visualizing nested pro¬ 
portions, 105 

nested pies, visualizing nested proportions, 
111-112 

no explicit axes, 272 
visualizing proportions, 39, 93-97 
emphasizing simple fractions, 96 
using bar plots instead of, 96 
plotting software, xii 

(see also data visualization software) 
focus of tutorials, xi 
PNG files, 321 

preferring over JPEG, 322 
saving to/from other formats, 324 
point estimates, 186-197 
points, 219 

(see also overlapping points, handling) 
solid points vs. open points, 302 
polar coordinates, 22 
poles, 161 
population, 186 
position, 7 
position scales, 11 

3D, appropriate use of, 313 
3D, avoiding, 307-313 
post processing draft figures, 328 
posterior, 194 
posterior distributions, 194 
visualizations of, 196 
power-law distributions, 74 
presentation of data (see data exploration vs. 
data presentation) 

principal component analysis (PCA), 125 


visualizing high-dimensional time series in 
PC space, 141 

principal components (PCs), 125 
principle of proportional ink, 207 

(see also proportional ink, principle of) 
prior, 194 

probability distributions, 183 
probability, framing probabilities as frequen¬ 
cies, 181-185 

programmatically generated figures (see auto¬ 
mation) 

programming books for data visualization, 352 
projections (map), 161-168 

Albers projection, equal-area, 167 
conformal and equal-area, 163 
Goode hemolosine, 165 
Mercator projection and variants, 163 
registries maintained by standards bodies, 
163 

proportional ink, principle of, 207-217, 277 
definition of ink, 207 
direct area visualizations, 215-217 
visualizations along linear axes, 208-212 
visualizations along logarithmic axes, 
212-215 

proportions, visualizing, 39, 93-103 
case for pie charts, 93-97 
case for side-by-side bar plots, 97-99 
nested proportions, 105-115 

flawed approaches to visualizing, 
105-107 

visualizing with mosaic plots and tree- 
maps, 107-110 

visualizing with nested pie charts, 
111-112 

visualizing with parallel sets plot, 
113-115 

separately as parts of a total, 101 
vase for stacked bars and stacked densities, 
99-101 

protanomaly/protanopia, 238, 244 
protein data visualization, 313 

Q 

q-q plots, 38 

visualizing distributions, 78-79 
qualitative color scales, 27, 239 
how to use, example, 28 
Okabe Ito scale, 241 
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qualitative data, 8 
quantile dot plots, 43, 184 
quantile-quantile plots (see q-q plots) 
quantitative data, 8 

visualizing associations among quantitative 
variables, 117 

(see also associations, visualizing) 

R 

R language, xii 

R Markdown, 355 

rainbow scale, 237 

raster graphics (see bitmap graphics) 

ratios 

on log scale, 18 
visualizing on log scales, 212 
using bars, 214 
redundant coding, 243-253 

designing legends with, 243-250 
red-green color-vision deficiency, 238, 244 
references, 357-359 
related topics, books on, 354 
relationships (x-y), visualizing, 41 
repeatability, 326 

applied to data visualizations, 326 
difficulty in achieving working with plotting 
software, 327 

repetitiveness, avoiding while being consistent, 
345-349 

reproducibility, 326 

applied to data visualizations, 326 
difficulty in achieving working with plotting 
software, 327 
response variable, 81 
ridgeline plots, 39, 43 

showing Bayesian posterior distributions, 
196 

visualizing many distributions, 88-91 
rotated labels, 46 

s 

sample, 186 
sample size, 186 
saturation of colors, 235, 244 
scale-free distributions, 74 
scales, 10-12 

color, 12, 27-34 
linear, 16 

log (logarithmic), 17 


position, 11 

scaling in small multiples, 257 
square-root, 20 
scatterplots 

background grids along both axes, 287 
connected, 42, 139 

drawing linear trend lines on top of points, 
152 

line drawings in, 301 
locally estimated scatterplot smoothing 
(LOESS), 148 

of paired data, using diagonal line instead of 
background grid, 287 
3D, and 3D position scales, 307 
3D, problems with, 311 
using direct labeling instead of legends, 251 
visualizations of large numbers of points, 41 
visualizing associations among quantitative 
variables, 117-121 
vs. dot plot of time series, 132 
visualizing paired data, 128 
with error bars, 193 

seasonal decomposition of time series by 
LOESS, 158 

sequential color scales, 29, 237 

for people with color-vision deficiency, 239 
shading 

for boxes in boxplots, 302 
in time series visualization along linear axes, 
209 

in visualizations along linear axes, 211 
shape, 7 
shape scales, 12 
significant differences, 190 
simplicity, disadvantage of, 343 
sina plots, 39, 87 
size, 7 

of colored graphical elements, 241 
size scales, 12 
slopegraphs, 129 
small multiples, 255-260 
building up toward, 341 
using instead of 3D visualizations, 312 
with too little context provided, 281 
smoothing, 145-150 

caution in interpreting results of smoothing 
functions, 150 

fuel-tank data represented with explicit ana¬ 
lytical model, 151 
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generalized additive models (GAMs), 150 
locally estimated scatterplot smoothing 
(LOESS), 148 

provided by data visualization software, 150 
unpredictable behavior of general-purpose 
smoothers, 151 
using spline models, 150 
source code for this book, 355 
splines, 150 
square-root scales, 20 

appropriate applications of, 21 
problems with, 21 
stacked bar plots, 52 
limitations of, 110 

visualizing multiple sets of proportions, 40 
visualizing proportions, 94, 99 
stacked densities 

visualizing proportions, 100-101 
visualizing proportions changing along a 
continuous variable, 40 
stacked histograms, 39 

visualizing multiple distributions, 64-66 
standard deviation, 186 
vs. standard error, 187 
standard error, 186 
statistical sampling, 186 
statistics texts, 352 
stories 

role in reasoning and memory, 333 
telling, 333 

(see also telling a story and making a 
point) 

understanding what a story is, 334-337 
story arc, 334 
strip charts, 39 

visualizing many distributions, 85 


tables, 273-275 

caption location relative to display items, 
275 

example plots with, 273 
formatting in figures, 273 
horizontal or vertical lines, causing visual 
clutter, 275 
technical notes, 355 

telling a story and making a point, 333-349 
being consistent but not repetitive, 345-349 


building up toward complex figures, 
341-343 

making a figure for the generals, 337-340 
making your figures memorable, 343-345 
understanding what a story is, 334-337 
text, 9 

legible, balanced with rest of figure, 293 
perceived darkness of, 283 
substitutions in fonts, 320 
too small compared with rest of figure, 291 
themes, 330 

thinking about data and visualization, books 
on, 351 

3D (see under Symbols) 

TIFF files, 322 

time series, 131-143, 145 

(see also trends, visualizing) 
decomposition of, 157 
visualizations along linear axes, 209 
visualizing for x-y relationships, 42 
visualizing individual time series, 131-135 
visualizing multiple time series and dose- 
response curves, 135-138 
with two or more response variables, visual¬ 
izing, 138-143 

titles 

axis and legend, 270-272 
figure (plot) titles and captions, 267-269 
topography, showing in 3D, 313 
transparency 

insufficiency for handling very large num¬ 
bers of points, 222 

using partial transparency for overlapping 
dots, 220 

transverse Mercator projection, 164 
treemaps, 40 

human perception, problem with, 217 
limitations of, 110 
mosaic plots vs., 109 
no explicit axes, 272 
visualizing nested proportions, 108 
nested pies vs., 112 
trellis plots, 255 

(see also small multiples) 
trends, visualizing, 145-160 

detrending and time-series decomposition, 
155-160 

in x-y relationships, 42 
smoothing a time series, 145-150 
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trends with a defined functional form, 
151-155 

tritanomaly/tritanopia, 238, 244 
2D (see under Symbols) 

u 

ugly, bad, and wrong figures, 2 
uncertainty, visualizing, 43, 181-203 
for curve fits, 197-201 
for point estimates, 186-197 
framing probabilities as frequencies, 

181-185 

hypothetical outcome plots, 201-203 
unordered factors, 11 
USA, Albers projection map of, 167 

V 

vector graphics, 319-321 

differences in appearance in different graph¬ 
ics programs, 320 


enormous file sizes for large/complex fig¬ 
ures, 321 

PDF, EPS, and SVG image file formats, 319 
saving to bitmap, 324 
violin plots, 39, 43 

visualizing many distributions, 84-85 
with error bars, 197 

vision, color-vision deficiency, 238-242, 244 
simulation of legends for, 249 
visual language, consistent, 261 
visualization software (see data visualization 
software) 

w 

web Mercator projection, 164 
wrong figures, 2 

X 

x-y relationships, visualizations of, 41 
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Preface 


You should look at your data. Graphs and charts let you explore 
and learn about the structure of the information you collect. Good 
data visualizations also make it easier to communicate your ideas 
and findings to other people. Beyond that, producing effective 
plots from your own data is the best way to develop a good eye 
for reading and understanding graphs—good and bad—made by 
others, whether presented in research articles, business slide decks, 
public policy advocacy, or media reports. This book teaches you 
how to do it. 

My main goal is to introduce you to both the ideas and the 
methods of data visualization in a sensible, comprehensible, repro¬ 
ducible way. Some classic works on visualizing data, such as The 
Visual Display of Quantitative Information (Tufte 1983), present 
numerous examples of good and bad work together with some 
general taste-based rules of thumb for constructing and assess¬ 
ing graphs. In what has now become a large and thriving field of 
research, more recent work provides excellent discussions of the 
cognitive underpinnings of successful and unsuccessful graphics, 
again providing many compelling and illuminating examples 
(Ware 2008). Other books provide good advice about how to graph 
data under different circumstances (Cairo 2013; Few 2009; Mun- 
zer 2014) but choose not to teach the reader about the tools used 
to produce the graphics they show. This may be because the soft¬ 
ware used is some (proprietary, costly) point-and-click application 
that requires a fully visual introduction of its own, such as Tableau, 
Microsoft Excel, or SPSS. Or perhaps the necessary software is 
freely available, but showing how to use it is not what the book 
is about (Cleveland 1994). Conversely, there are excellent cook¬ 
books that provide code “recipes” for many kinds of plot (Chang 
2013). But for that reason they do not take the time to introduce the 
beginner to the principles behind the output they produce. Finally, 
we also have thorough introductions to particular software tools 



and libraries, including the ones we will use in this book (Wickham 
2016). These can sometimes be hard for beginners to digest, as they 
may presuppose a background that the reader does not have. 

Each of the books I have just cited is well worth your time. 
When teaching people how to make graphics with data, however, 
I have repeatedly found the need for an introduction that motivates 
and explains why you are doing something but that does not skip 
the necessary details of how to produce the images you see on the 
page. And so this book has two main aims. First, I want you to get 
to the point where you can reproduce almost every figure in the 
text for yourself. Second, I want you to understand why the code 
is written the way it is, such that when you look at data of your 
own you can feel confident about your ability to get from a rough 
picture in your head to a high-quality graphic on your screen or 
page. 


What You Will Learn 

This book is a hands-on introduction to the principles and prac¬ 
tice of looking at and presenting data using R and ggplot. R is a 
powerful, widely used, and freely available programming language 
for data analysis. You may be interested in exploring ggplot after 
having used R before or be entirely new to both R and ggplot and 
just want to graph your data. I do not assume you have any prior 
knowledge of R. 

After installing the software we need, webeginwith an overview 
of some basic principles ofvisualization. We focus not j ust on the aes¬ 
thetic aspects of good plots but on how their effectiveness is rooted 
in the way we perceive properties like length, absolute and relative 
size, orientation, shape, and color. We then learn how to produce 
and refine plots using ggplot2, a powerful, versatile, and widely 
used visualization package for R (Wickham 2016). The ggplot2 
library implements a “grammar of graphics” (Wilkinson 2005). 
This approach gives us a coherent way to produce visualizations 
by expressing relationships between the attributes of data and their 
graphical representation. 

Through a series of worked examples, you will learn how to 
build plots piece by piece, beginning with scatterplots and sum¬ 
maries of single variables, then moving on to more complex graph¬ 
ics. Topics covered include plotting continuous and categorical 



variables; layering information on graphics; faceting grouped data 
to produce effective “small multiple” plots; transforming data to 
easily produce visual summaries on the graph such as trend lines, 
linear fits, error ranges, and boxplots; creating maps; and some 
alternatives to maps worth considering when presenting country- 
or state-level data. We will also cover cases where we are not 
working directly with a dataset but rather with estimates from a 
statistical model. From there, we will explore the process of refin¬ 
ing plots to accomplish common tasks such as highlighting key 
features of the data, labeling particular items of interest, anno¬ 
tating plots, and changing their overall appearance. Finally we 
will examine some strategies for presenting graphical results in 
different formats and to different sorts of audiences. 

If you follow the text and examples in this book, then by the 
end you will 

. understand the basic principles behind effective data visualiza¬ 
tion; 

• have a practical sense for why some graphs and figures work 
well, while others may fail to inform or actively mislead; 

. know how to create a wide range of plots in R using ggplot2; 
and 

. know how to refine plots for effective presentation. 

Learning how to visualize data effectively is more than just 
knowing how to write code that produces figures from data. This 
book will teach you how to do that. But it will also teach you 
how to think about the information you want to show, and how to 
consider the audience you are showing it to—including the most 
common case, when the audience is yourself. 

This book is not a comprehensive guide to R, or even a com¬ 
prehensive survey of everything ggplot can do. Nor is it a cookbook 
containing just examples of specific things people commonly want 
to do with ggplot. (Both these sorts of books already exist: see the 
references in the appendix.) Neither is it a rigid set of rules, or a 
sequence of beautiful finished examples that you can admire but 
not reproduce. My goal is to get you quickly up and running in R, 
making plots in a well-informed way, with a solid grasp of the core 
sequence of steps—taking your data, specifying the relationship 
between variables and visible elements, and building up images 
layer by layer—that is at the heart of what ggplot does. 



Learning ggplot does mean getting used to how R works, and 
also understanding how ggplot connects to other tools in the R 
language. As you work your way through the book, you will grad¬ 
ually learn more about some very useful idioms, functions, and 
techniques for manipulating data in R. In particular you will learn 
about some of the tools provided by the tidyverse library that 
ggplot belongs to. Similarly, although this is not a cookbook, once 
you get past chapter 1 you will be able to see and understand the 
code used to produce almost every figure in the book. In most 
cases you will also see these figures built up piece by piece, a step 
at a time. If you use the book as it is designed, by the end you will 
have the makings of a version of the book itself, containing code 
you have written out and annotated yourself. And though we do 
not go into great depth on the topic of rules or principles of visual¬ 
ization, the discussion in chapter 1 and its application throughout 
the book gives you more to think about than just a list of graph 
types. By the end of the book you should be able to look at a figure 
and be able to see it in terms of ggplot’s grammar, understand¬ 
ing how the various layers, shapes, and data are pieced together to 
make a finished plot. 


The Right Frame of Mind 

It can be a little disorienting to learn a programming language like 
R, mostly because at the beginning there seem to be so many pieces 
to fit together in order for things to work properly. It can seem like 
you have to learn everything before you can do anything. The lan¬ 
guage has some possibly unfamiliar concepts that define how it 
works, like “object,” “function,” or “class.” The syntactic rules for 
writing code are annoyingly picky. Error messages seem obscure; 
help pages are terse; other people seem to have had not quite the 
same issue as you. Beyond that, you sense that doing one thing 
often involves learning a bit about some other part of the language. 
To make a plot you need a table of data, but maybe you need to 
filter out some rows, recalculate some columns, or just get the com¬ 
puter to see it is there in the first place. And there is also a wider 
environment of supporting applications and tools that are good to 
know about but involve new concepts of their own—editors that 
highlight what you write; applications that help you organize your 
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code and its output; ways of writing your code that let you keep 
track of what you have done. It can all seem a bit confusing. 

Don’t panic. You have to start somewhere. Starting with graph¬ 
ics is more rewarding than some of the other places you might 
begin, because you will be able to see the results of your efforts very 
quickly. As you build your confidence and ability in this area, you 
will gradually see the other tools as things that help you sort out 
some issue or solve a problem that’s stopping you from making the 
picture you want. That makes them easier to learn. As you acquire 
them piecemeal—perhaps initially using them without completely 
understanding what is happening—you will begin to see how they 
fit together and be more confident of your own ability to do what 
you need to do. 

Even better, in the past decade or so the world of data anal¬ 
ysis and programming generally has opened up in a way that has 
made help much easier to come by. Free tools for coding have been 
around for a long time, but in recent years what we might call the 
“ecology of assistance” has gotten better. There are more resources 
available for learning the various pieces, and more of them are ori¬ 
ented to the way writing code actually happens most of the time— 
which is to say, iteratively, in an error-prone fashion, and taking 
account of problems other people have run into and solved before. 


How to Use This Book 

This book can be used in any one of several ways. At a minimum, 
you can sit down and read it for a general overview of good prac¬ 
tices in data visualization, together with many worked examples of 
graphics from their beginnings to a properly finished state. Even 
if you do not work through the code, you will get a good sense of 
how to think about visualization and a better understanding of the 
process through which good graphics are produced. 

More useful, if you set things up as described in chapter 2 and 
then work through the examples, you will end up with a data visu¬ 
alization book of your own. If you approach the book this way, 
then by the end you will be comfortable using ggplot in particular 
and also be ready to learn more about the R language in general. 

This book can also be used to teach with, either as the main 
focus of a course on data visualization or as a supplement to 


You can also bring your own data to 
explore instead of or alongside the 
examples, as described in chapter 2. 
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undergraduate or graduate courses in statistics or data analysis. My 
aim has been to make the “hidden tasks” of coding and polishing 
graphs more accessible and explicit. I want to make sure you are 
not left with the “How to Draw an Owl in Three Steps” problem 
common to many tutorials. You know the one. The first two steps 
are shown clearly enough. Sketch a few bird-shaped ovals. Make 
a line for a branch. But the final step, an owl such as John James 
Audubon might have drawn, is presented as a simple extension for 
readers to figure out for themselves. 

If you have never used R or ggplot, you should start at the 
beginning of the book and work your way through to the end. 
If you know about R already and only want to learn the core of 
ggplot, then after installing the software described below, focus on 
chapters 3 through 5. Chapter 6 (on models) necessarily incorpo¬ 
rates some material on statistical modeling that the book cannot 
develop fully. This is not a statistics text. So, for example, I show 
generally how to fit and work with various kinds of model in chap¬ 
ter 6, but I do not go through the important details of fitting, 
selecting, and fully understanding different approaches. I provide 
references in the text to other books that have this material as their 
main focus. 

Each chapter ends with a section suggesting where to go next 
(apart from continuing to read the book). Sometimes I suggest 
other books or websites to explore. I also ask questions or pose 
some challenges that extend the material covered in the chapter, 
encouraging you to use the concepts and skills you have learned. 

Conventions 

In this book we alternate between regular text (like this), samples 
of code that you can type and run yourself, and the output of that 
code. In the main text, references to objects or other things that 
exist in the R language or in your R project—tables of data, vari¬ 
ables, functions, and so on—will also appear in a monospaced or 
“typewriter” typeface. Code you can type directly into R at the 
console will be in gray boxes and also monospaced, like this: 


my.numbers •<- c(1, 1, 4, 1, 1, 4, 1) 


Additional notes and information will sometimes 
appear in the margin, like this. 


If you type that line of code into R’s console, it will create a 
thing called my .numbers. Doing this doesn’t produce any output, 



however. When we write code that also produces output at the con¬ 
sole, we will first see the code (in a gray box) and then the output 
in a monospaced font against a white background. Here we add 
two numbers and see the result: 

4 + 1 

## [1] 5 

Two further notes about how to read this. First, by default in 
this book, anything that comes back to us at the console as the 
result of typing a command will be shown prefaced by two hash 
characters (tttt) at the beginning of each line of output. This is to 
help distinguish it from commands we type into the console. You 
will not see the hash characters at the console when you use R. 

Second, both in the book and at the console, if the output of 
what you did results in a series of elements (numbers, observa¬ 
tions from a variable, and so on), you will often see output that 
includes some number in square brackets at the beginning of the 
line. It looks like this: [l]. This is not part of the output itself 
but just a counter or index keeping track of how many items have 
been printed out so far. In the case of adding 4 + 1 we got just 
one, or [1 ], thing back—the number five. If there are more ele¬ 
ments returned as the result of some instruction or command, the 
counter will keep track of that on each line. In this next bit of code 
we will tell R to show us the lowercase letters of the alphabet: 


letters 



II [21] " U " V "w" "x" "y" "z" 

You can see the counter incrementing on each line as it keeps 
count of how many letters have been printed. 


Before You Begin 

The book is designed for you to follow along in an active way, writ¬ 
ing out the examples and experimenting with the code as you go. 
You will be able to reproduce almost all the plots in the text. You 
need to install some software first. Here is what to do: 
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cloud.r-project.org 


rstudio.com 


tidyverse.org 


1. Get the most recent version of R. It is free and available for 
Windows, Mac, and Linux operating systems. Download the 
version of R compatible with your operating system. If you are 
running Windows or MacOS, choose one of the precompiled 
binary distributions (i.e., ready-to-run applications) linked at 
the top of the R Project’s web page. 

2. Once R is installed, download and install R Studio, which is an 
“Integrated Development Environment,” or IDE. This means 
it is a front-end for R that makes it much easier to work with. R 
Studio is also free and available for Windows, Mac, and Linux 
platforms. 

3. Install the tidyverse and several other add-on packages for R. 
These packages provide useful functionality that we will take 
advantage of throughout the book. You can learn more about 
the tidyverse’s family of packages at its website. 


I strongly recommend typing all the code examples 
rightfrom the beginning, instead of copying and 
pasting. 


To install the tidyverse, make sure you have an internet con¬ 
nection and then launch R Studio. Type the following lines of code 
at R’s command prompt, located in the window named “Console,” 
and hit return. In the code below, the <- arrow is made up of two 
keystrokes, first < and then the short dash or minus symbol, 


my_packages <- c("tidyverse", "broom", "coefplot", "cowplot", 

"gapminder", "GGally", "ggrepel", "ggridges", "gridExtra", 
"here", "interplot", "margins", "maps", "mapproj", 
"mapdata", "MASS", "quantreg", "rlang", "scales", 

"survey", "srvyr", "viridis", "viridisLite", "devtools") 

install.packages(my_packages, repos = "http://cran.rstudio.com") 


github.com 

GitHub is a web-based service where users can host, 
develop, and share code. It uses git, a version control 
system that allows projects, or repositories, to 
preserve their history and incorporate changes from 
contributors in an organized way. 


R Studio should then download and install these packages for 
you. It may take a little while to download everything. 

With these packages available, you can then install one last 
library of material that’s useful specifically for this book. It is 
hosted on GitHub, rather than R’s central package repository, so 
we use a different function to fetch it. 

devtools::install_github("kjhealy/socviz") 


Once you’ve done that, we can get started. 



Data Visualization 




1 Look at Data 


Some data visualizations are better than others. This chapter dis¬ 
cusses why that is. While it is tempting to simply start laying down 
the law about what works and what doesn’t, the process of making a 
really good or really useful graph cannot be boiled down to a list of 
simple rules to be followed without exception in all circumstances. 
The graphs you make are meant to be looked at by someone. The 
effectiveness of any particular graph is not just a matter of how 
it looks in the abstract but also a question of who is looking at 
it, and why. An image intended for an audience of experts read¬ 
ing a professional journal may not be readily interpretable by the 
general public. A quick visualization of a dataset you are currently 
exploring might not be of much use to your peers or your students. 

Some graphs work well because they depend in part on some 
strong aesthetic judgments about what will be effective. That sort 
of good judgment is hard to systematize. However, data visualiza¬ 
tion is not simply a matter of competing standards of good taste. 
Some approaches work better for reasons that have less to do with 
one’s sense of what looks good and more to do with how human 
visual perception works. When starting out, it is easier to grasp 
these perceptual aspects of data visualization than it is to get a reli¬ 
able, taste-based feel for what works. For this reason, it is better 
to begin by thinking about the relationship between the structure 
of your data and the perceptual features of your graphics. Getting 
into that habit will carry you a long way toward developing the 
ability to make good taste-based judgments, too. 

As we shall see later on, when working with data in R and 
ggplot, we get many visualization virtues for free. In general, the 
default layout and look of ggplot’s graphics is well chosen. This 
makes it easier to do the right thing. It also means that, if you really 
just want to learn how to make some plots right this minute, you 
could skip this chapter altogether and go straight to the next one. 
But although we will not be writing any code for the next few pages, 
we will be discussing aspects of graph construction, perception, 
and interpretation that matter for code you will choose to write. So 
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I urge you to stick around and follow the argument of this chapter. 
When making graphs there is only so much that your software can 
do to keep you on the right track. It cannot force you to be honest 
with yourself, your data, and your audience. The tools you use can 
help you live up to the right standards. But they cannot make you 
do the right thing. This means it makes sense to begin cultivating 
your own good sense about graphs right away. 

We will begin by asking why we should bother to look at pic¬ 
tures of data in the first place, instead of relying on tables or numer¬ 
ical summaries. Then we will discuss a few examples, first of bad 
visualization practice, and then more positively of work that looks 
(and is) much better. We will examine the usefulness and limits of 
general rules of thumb in visualization and show how even taste¬ 
ful, well-constructed graphics can mislead us. From there we will 
briefly examine some of what we know about the perception of 
shapes, colors, and relationships between objects. The core point 
here is that we are quite literally able to see some things much more 
easily than others. These cognitive aspects of data visualization 
make some kinds of graphs reliably harder for people to interpret. 
Cognition and perception are relevant in other ways, too. We tend 
to make inferences about relationships between the objects that 
we see in ways that bear on our interpretation of graphical data, 
for example. Arrangements of points and lines on a page can 
encourage us—sometimes quite unconsciously—to make infer¬ 
ences about similarities, clustering, distinctions, and causal rela¬ 
tionships that might or might not be there in the numbers. Some¬ 
times these perceptual tendencies can be honestly harnessed to 
make our graphics more effective. At other times, they will tend to 
lead us astray, and we must take care not to lean on them too much. 

In short, good visualization methods offer extremely valuable 
tools that we should use in the process of exploring, understand¬ 
ing, and explaining data. But they are not a magical means of 
seeing the world as it really is. They will not stop you from try¬ 
ing to fool other people if that is what you want to do, and they 
may not stop you from fooling yourself either. 


1.1 Why Look at Data? 

Anscombe’s quartet (Anscombe 1973; Chatterjee & Firat 2007), 
shown in figure 1.1, presents its argument for looking at data in 
visual form. It uses a series of four scatterplots. A scatterplot shows 


Figure 1.1: Plots of Anscombe's quartet. 
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the relationship between two quantities, such as height and weight, 
age and income, or time and unemployment. Scatterplots are the 
workhorse of data visualization in social science, and we will be 
looking at a lot of them. The data for Anscombe’s plots comes 
bundled with R. You can look at it by typing anscombe at the com¬ 
mand prompt. Each of the four made-up “datasets” contains eleven 
observations of two variables, x and y. By construction, the numer¬ 
ical properties of each pair ofx and y variables, such as their means, 
are almost identical. Moreover, the standard measures of the asso¬ 
ciation between each x and y pair also match. The correlation 
coefficient is a strong 0.81 in every case. But when the datasets 
are visualized as a scatterplot, with the x variables plotted on the 
horizontal axis and the y variables on the vertical, the differences 
between them are readily apparent. 

Anscombe’s quartet is an extreme, manufactured example. But 
the benefits of visualizing one’s data can be shown in real cases. 
Figure 1.2 shows a graph from Jackman (1980), a short comment 
on Hewitt (1977). The original paper had argued for a significant 
association between voter turnout and income inequality based on 
a quantitative analysis of eighteen countries. When this relation¬ 
ship was graphed as a scatterplot, however, it immediately became 
clear that the quantitative association depended entirely on the 
inclusion of South Africa in the sample. 

An exercise by Jan Vanhove (2016) demonstrates the useful¬ 
ness of looking at model fits and data at the same time. Figure 1.3 
presents an array of scatterplots. As with Anscombe’s quartet, each 
panel shows the association between two variables. Within each 
panel, the correlation between the x and y variables is set to be 
0.6, a pretty good degree of association. But the actual distribution 
of points is created by a different process in each case. In the top 
left panel each variable is normally distributed around its mean 
value. In other panels there is a single outlying point far off in one 
direction or another. Others are produced by more subtle rules. 
But each gives rise to the same basic linear association. 

Illustrations like this demonstrate why it is worth looking at 
data. But that does not mean that looking at data is all one needs 
to do. Real datasets are messy, and while displaying them graph¬ 
ically is very useful, doing so presents problems of its own. As 
we will see below, there is considerable debate about what sort of 
visual work is most effective, when it can be superfluous, and how 
it can at times be misleading to researchers and audiences alike. 
Just like seemingly sober and authoritative tables of numbers, data 


Correlations can run from -1 to 1, with zero meaning 
there is no association. A score of-1 means a perfect 
negative association and a score of 1 a perfect 
positive asssociation between the two variables. 

So 0.81 counts as a strong positive correlation. 



Key. -Bivariate slope including South Africa (A/—18) 

-Bivariate slope excluding South Africa (Ai-17) 


Figure 1.2: Seeing the effect of an outlier on a 
regression line. 


A more careful quantitative approach could have 
found this issue as well, for example, with a proper 
sensitivity analysis. But the graphic makes the case 
directly. 
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Figure 1.3: What data patterns can lie behind a correlation? The correlation coefficient in all these plots is 0.6. Figure adapted from code by Jan 
Vanhove. 


















visualizations have their own rhetoric of plausibility. Anscombe’s 
quartet notwithstanding, and especially for large volumes of data, 
summary statistics and model estimates should be thought of as 
tools that we use to deliberately simplify things in a way that lets 
us see past a cloud of data points shown in a figure. We will not 
automatically get the right answer to our questions just by looking. 


1.2 What Makes Bad Figures Bad? 

It is traditional to begin discussions of data visualization with a 
“parade of horribles,” in an effort to motivate good behavior later. 
However, these negative examples often combine several kinds of 
badness that are better kept separate. For convenience, we can 
say that our problems tend to come in three varieties. Some are 
strictly aesthetic. The graph we are looking at is in some way 
tacky, tasteless, or a hodgepodge of ugly or inconsistent design 
choices. Some are substantive. Here, our graph has problems that 
are due to the data being presented. Good taste might make 
things look better, but what we really need is to make better 
use of the data we have, or get new information and plot that 
instead. And some problems are perceptual. In these cases, even 
with good aesthetic qualities and good data, the graph will be 
confusing or misleading because of how people perceive and pro¬ 
cess what they are looking at. It is important to understand that 
these elements, while often found together, are distinct from one 
another. 


Bad taste 

Let’s start with the bad taste. The chart in figure 1.4 both is taste¬ 
less and has far too much going on in it, given the modest amount 
of information it displays. The bars are hard to read and com¬ 
pare. It needlessly duplicates labels and makes pointless use of 
three-dimensional effects, drop shadows, and other unnecessary 
design features. 

The best-known critic by far of this style of visualization, and 
the best-known taste-maker in the field, is Edward R. Tufte. His 
book The Visual Display of Quantitative Information (1983) is a 
classic, and its sequels are also widely read (Tufte 1990, 1997). The 
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Figure 1.4: A chart with a considerable amount of 
junk in it. 



bulk of this work is a series of examples of good and bad visualiza¬ 
tion, along with some articulation of more general principles (or 
rules of thumb) extracted from them. It is more like a reference 
book about completed dishes than a cookbook for daily use in the 
kitchen. At the same time, Tufte’s early academic work in political 
science shows that he effectively applied his own ideas to research 
questions. His Political Control of the Economy (1978) combines 
tables, figures, and text in a manner that remains remarkably fresh 
almost forty years later. 

Tufte’s message is sometimes frustrating, but it is consistent: 

Graphical excellence is the well-designed presentation of interest¬ 
ing data—a matter of substance, of statistics, and of design.... 

[ft] consists of complex ideas communicated with clarity, preci¬ 
sion, and efficiency.... [ft] is that which gives to the viewer the 
greatest number of ideas in the shortest time with the least ink 

in the smallest space_ [ft] is nearly always multivariate_And 

graphical excellence requires telling the truth about the data. (Tufte 
1983,51) 


Tufte illustrates the point with Charles Joseph Minard’s famous 
visualization of Napoleons march on Moscow, shown here in 
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Figure 1.5: Minard's visualization of Napoleon's 
retreat from Moscow. Justifiably cited as a classic, it 
is also atypical and hard to emulate in its specifics. 


figure 1.5. He remarks that this image “may well be the best 
statistical graphic ever drawn” and argues that it “tells a rich, coher¬ 
ent story with its multivariate data, far more enlightening than just 
a single number bouncing along over time. Six variables are plot¬ 
ted: the size of the army, its location on a two-dimensional surface, 
direction of the army’s movement, and temperature on various 
dates during the retreat from Moscow.” 

It is worth noting how far removed Minard’s image is from 
most contemporary statistical graphics. At least until recently, 
these have tended to be applications or generalizations of scatter- 
plots and bar plots, in the direction of either seeing more raw data 
or seeing the output derived from a statistical model. The former 
looks for ways to increase the volume of data visible, or the number 
of variables displayed within a panel, or the number of panels dis¬ 
played within a plot. The latter looks for ways to see results such 
as point estimates, confidence intervals, and predicted probabil¬ 
ities in an easily comprehensible way. Tufte acknowledges that a 
tour de force such as Minard’s “can be described and admired, but 
there are no compositional principles on how to create that one 
wonderful graphic in a million.” The best one can do for “more 
routine, workaday designs” is to suggest some guidelines such as 
“have a properly chosen format and design,” “use words, num¬ 
bers, and drawing together,” “display an accessible complexity of 
detail,” and “avoid content-free decoration, including chartjunk” 
(Tufte 1983, 177). 

In practice those compositional principles have amounted to 
an encouragement to maximize the “data-to-ink” ratio. This is 
practical advice. It is not hard to jettison tasteless junk, and if 
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Figure 1.6: "Monstrous Costs” by Nigel Holmes 
(1982). Also a classic of its kind. 



a b c d e f 

Figure 1.7: Six kinds of summary boxplots.Type (c) is 
from Tufte. 


we look a little harder we may find that the chart can do with¬ 
out other visual scaffolding as well. We can often clean up the 
typeface, remove extraneous colors and backgrounds, and sim¬ 
plify, mute, or delete gridlines, superfluous axis marks, or needless 
keys and legends. Given all that, we might think that a solid 
rule of “simpify, simplify” is almost all of what we need to make 
sure that our charts remain junk-free and thus effective. Unfor¬ 
tunately this is not the case. For one thing, somewhat annoy¬ 
ingly, there is evidence that highly embellished charts like Nigel 
Holmes’s “Monstrous Costs” (fig. 1.6) are often more easily recalled 
than their plainer alternatives (Bateman et al. 2010). Viewers do 
not find them more easily interpretable, but they do remember 
them more easily and also seem to find them more enjoyable to 
look at. They also associate them more directly with value judg¬ 
ments, as opposed to just trying to get information across. Borkin 
et al. (2013) also found that visually unique, “infographic”-style 
graphs were more memorable than more standard statistical visu¬ 
alizations. (“It appears that novel and unexpected visualizations 
can be better remembered than the visualizations with limited 
variability that we are exposed to since elementary school,” they 
remark.) 

Even worse, it maybe the case that graphics that really do max¬ 
imize the data-to-ink ratio are harder to interpret than those that 
are a little more relaxed about it. E. W. Anderson et al. (2011) found 
that, of the six kinds of boxplot shown in figure 1.7, the minimalist 
version from Tufte’s own work (option C) proved to be the most 
cognitively difficult for viewers to interpret. Cues like labels and 
gridlines, together with some strictly superfluous embellishment 
of data points or other design elements, may often be an aid rather 
than an impediment to interpretation. 

While chartjunk is not entirely devoid of merit, bear in mind 
that ease of recall is only one virtue among many for graphics. 
It is also the case that, almost by definition, it is no easier to 
systematize the construction of a chart like “Monstrous Costs” 
than it is to replicate the impact of Minard’s graph of Napoleons 
retreat. Indeed, the literature on chartjunk suggests that the two 
may have some qualities in common. To be sure, Minard’s figure 
is admirably rich in data while Holmes’s is not. But both are visu¬ 
ally distinctive in a way that makes them memorable, both show 
a substantial amount of bespoke design, and both are unlike most 
of the statistical graphs you will see or make. 
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Bad data 

In your everyday work you will be in little danger of produc¬ 
ing either a “Monstrous Costs” or a “Napoleon’s Retreat.” You are 
much more likely to make a good-looking, well-designed figure 
that misleads people because you have used it to display some bad 
data. Well-designed figures with little or no junk in their compo¬ 
nent parts are not by themselves a defense against cherry-picking 
your data or presenting information in a misleading way. Indeed, 
it is even possible that, in a world where people are on guard 
against junky infographics, the “halo effect” accompanying a well- 
produced figure might make it easier to mislead some audiences. 
Or, perhaps more common, good aesthetics does not make it much 
harder for you to mislead yourself as you look at your data. 

In November 2016 the New York Times reported on some 
research on people’s confidence in the institutions of democracy. 
It had been published in an academic journal by the political sci¬ 
entists Yascha Mounk and Roberto Stefan Foa. The headline in 
the Times ran “How Stable Are Democracies? ‘Warning Signs Are 
Flashing Red’ ” (Taub 2016). The graph accompanying the article, 
reproduced in figure 1.8, certainly seemed to show an alarming 
decline. 

The graph was widely circulated on social media. It is impres¬ 
sively well produced. It’s an elegant small-multiple that, in addition 
to the point ranges it identifies, also shows an error range (labeled 
as such for people who might not know what it is), and the story 
told across the panels for each country is pretty consistent. 


Percentage of people who say it is "essential" to live in a democracy 

Sweden Australia Netherlands United States New Zealand Britain 


100 % 

75% 

50% 

25 % 

1930s 1980s '30s '80s '30s '80s '30s '80s '30s '80s '30s '80s 

Decade of birth 



Figure 1.8: A crisis of faith in democracy? 
(Source: Roberto Stefan Foa and Yascha Mounk, 
'The Signs of Deconsolidation," Journal of 
Democracy, 28(1), 5-16.) 
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One reason I chose this example is that, at the time 
of writing, it is not unreasonable to be concerned 
about the stability of people's commitment to 
democratic government in some Western countries. 
Perhaps Mounk's argument is correct. But in such 
cases, the question is how much we are letting the 
data speak to us, as opposed to arranging it to say 
what we already think for other reasons. 


The figure is a little tricky to interpret. As the x-axis label says, 
the underlying data are from a cross-sectional survey of people of 
different ages rather than a longitudinal study measuring everyone 
at different times. Thus the lines do not show a trend measured 
each decade from the 1930s but rather differences in the answers 
given by people born in different decades, all of whom were asked 
the question at the same time. Given that, a bar graph might have 
been a more appropriate to display the results. 

More important, as the story circulated, helped by the com¬ 
pelling graphic, scholars who knew the World Values Survey data 
underlying the graph noticed something else. The graph reads 
as though people were asked to say whether they thought it was 
essential to live in a democracy, and the results plotted show the 
percentage of respondents who said “Yes,” presumably in contrast 
to those who said “No.” But in fact the survey question asked 
respondents to rate the importance of living in a democracy on 
a ten-point scale, with 1 being “Not at all Important” and 10 being 
“Absolutely Important.” The graph showed the difference across 
ages of people who had given a score of 10 only, not changes in the 
average score on the question. As it turns out, while there is some 
variation by year of birth, most people in these countries tend to 
rate the importance of living in a democracy very highly, even if 
they do not all score it as “Absolutely Important.” The political sci¬ 
entist Erik Voeten redrew the figure using the average response. 
The results are shown in figure 1.9. 


Figure 1.9: Perhaps the crisis has been overblown. 
(Erik Voeten.) 
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Graph by Erik Voeten, based on WVS 5 
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The change here is not due to a difference in how the y-axis 
is drawn. That is a common issue with graphs, and one we will 
discuss below. In this case both the New York Times graph and 
Voeten’s alternative have scales that cover the full range of pos¬ 
sible values (from 0 to 100% in the former case and from 1 to 10 
in the latter). Rather, a different measure is being shown. We are 
now looking at the trend in the average score, rather than the trend 
for the highest possible answer. Substantively, there does still seem 
to be a decline in the average score by age cohort, on the order 
of between 0.5 point and 1.5 points on a 10-point scale. It could 
be an early warning sign of a collapse of belief in democracy, or it 
could be explained by something else. It might even be reasonable 
(as we will see for a different example shortly) to present the data 
in Voeten’s version with the y-axis covering just the range of the 
decline, rather than the full 0-10 scale. But it seems fair to say that 
the story might not have made the New York Times if the original 
research article had presented Voeten’s version of the data rather 
than the one that appeared in the newspaper. 


Bad perception 

Our third category of badness lives in the gap between data and 
aesthetics. Visualizations encode numbers in lines, shapes, and 
colors. That means that our interpretation of these encodings is 
partly conditional on how we perceive geometric shapes and rela¬ 
tionships generally. We have known for a long time that poorly 
encoded data can be misleading. Tufte (1983) contains many 
examples, as does Wainer (1984). Many of the instances they cite 
revolve around needlessly multiplying the number of dimensions 
shown in a plot. Using an area to represent a length, for exam¬ 
ple, can make differences between observations look larger than 
they are. 

Although the most egregious abuses are less common than 
they once were, adding additional dimensions to plots remains a 
common temptation. Figure 1.10, for instance, is a 3-D bar chart 
made using a recent version of Microsoft Excel. Charts like this are 
common in business presentations and popular journalism and are 
also seen in academic journal articles from time to time. Here we 

seek to avoid too much junk by using Excel’s default settings. As To be fair, the 3-D format is not Excel's default type 

of barchart. 
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Figure 1.10: A 3-D column chart created 
in Microsoft Excel for Mac. Although it may 
seem hard to believe, the values shown in the 
bars are 1,2,3, and 4. 



you can see from the cells shown to the left of the chart, the data we 
are trying to plot is not very complex. The chart even tries to help 
us by drawing and labeling grid lines on the y- (and z-) axes. And 
yet the 3-D columns in combination with the default angle of view 
for the chart make the values as displayed differ substantially from 
the ones actually in the cell. Each column appears to be somewhat 
below its actual value. It is possible to see, if you squint with your 
mind’s eye, how the columns would line up with the axis guide¬ 
lines if your angle of view moved so that the bars were head-on. 
But as it stands, anyone asked what values the chart shows would 
give the wrong answer. 

By now, many regular users of statistical graphics know 
enough to avoid excessive decorative embellishments of charts. 
They are also usually put on their guard by overly elaborate pre¬ 
sentation of simple trends, as when a three-dimensional ribbon 
is used to display a simple line. Moreover, the default settings of 
most current graphical software tend to make the user work a little 
harder to add these features to plots. 

Even when the underlying numbers are sensible, the default 
settings of software are good, and the presentation of charts is 
mostly junk-free, some charts remain more difficult to interpret 
than others. They encode data in ways that are hard for viewers to 
understand. Figure 1.11 presents a stacked bar chart with time in 
years on the x-axis and some value on the y-axis. The bars show 
the total value, with subdivisions by the relative contribution of 
different categories to each years observation. Charts like this are 
common when showing the absolute contribution of various prod¬ 
ucts to total sales over time, for example, or the number of different 
groups of people in a changing population. Equivalently, stacked 




































Look at Data . 13 


Type ■ A ■ B ■ C U D 


Figure 1.11: Ajunk-free plot that remains hard to 
interpret. While a stacked bar chart makes the 
overall trend clear, it can make it harder to see the 
trends for the categories within the bar. This is partly 
due to the nature of the trends. But if the additional 
data is hard to understand, perhaps it should not be 
included to begin with. 

2004 2007 2010 2013 2016 

Year 




Figure 1.12: Aspect ratios affect our perception of 
rates of change. (After an example by William S. 
Cleveland.) 


line-graphs showing similar kinds of trends are also common for 
data with many observation points on the x-axis, such as quarterly 
observations over a decade. 

In a chart like this, the overall trend is readily interpretable, 
and it is also possible to easily follow the over-time pattern of the 
category that is closest to the x-axis baseline (in this case, type D, 
in purple). But the fortunes of the other categories are not so easily 
grasped. Comparisons of both the absolute and the relative share of 
type B or C are much more difficult, whether one wants to compare 
trends within type or between them. Relative comparisons need 
a stable baseline. In this case, that’s the x-axis, which is why the 
overall trend and the type D trend are much easier to see than any 
other trend. 

A different sort of problem is shown in figure 1.12. In the left 
panel, the lines appear at first glance to be converging as the 
value of x increases. It seems like they might even intersect if 
we extended the graph out further. In the right panel, the curves 
are clearly equidistant from the beginning. The data plotted in each 
panel is the same, however. The apparent convergence in the left 
panel is just a result of the aspect ratio of the figure. 
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These problems are not easily solved by the application of good 
taste, or by following a general rule to maximize the data-to-ink 
ratio, even though that is a good rule to follow. Instead, we need to 
know a little more about the role of perception in the interpretation 
of graphs. Fortunately for us, this is an area that has produced a 
substantial amount of research over the past twenty-five years. 

1.3 Perception and Data Visualization 

While a detailed discussion of visual perception is well beyond 
the scope of this book, even a simple sense of how we see things 
will help us understand why some figures work and others do 
not. For a much more thorough treatment of these topics, Colin 
Ware’s books on information design are excellent overviews of 
research on visual perception, written from the perspective of peo¬ 
ple designing graphs, figures, and systems for representing data 
(Ware 2008, 2013). 


Edges, contrasts, and colors 



Figure 1.13: Hermann grid effect. 


Looking at pictures of data means looking at lines, shapes, and 
colors. Our visual system works in a way that makes some things 
easier for us to see than others. I am speaking in slightly vague 
terms here because the underlying details are the remit of vision 
science, and the exact mechanisms responsible are often the sub¬ 
ject of ongoing research. I will not pretend to summarize or evalu¬ 
ate this material. In any case, independent of detailed explanation, 
the existence of the perceptual phenomena themselves can often be 
directly demonstrated through visual effects or “optical illusions” 
of various kinds. These effects demonstrate that perception is not 
a simple matter of direct visual inputs producing straightforward 
mental representations of their content. Rather, our visual system 
is tuned to accomplish some tasks very well, and this comes at a 
cost in other ways. 

The active nature of perception has long been recognized. The 
Hermann grid effect, shown in figure 1.13, was discovered in 1870. 
Ghostly blobs seem to appear at the intersections in the grid but 
only as long as one is not looking at them directly. A related effect 
is shown in figure 1.14. These are Mach bands. When the gray bars 
share a boundary, the apparent contrast between them appears to 
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Figure 1.14: Mach bands. On the left side, five gray 
bars are ordered from dark to light, with gaps 
between them. On the right side, the bars have no 
gap between them. The brightness or luminance of 
the corresponding bars is the same. However, when 
the bars touch, the dark areas seem darker and the 
light areas lighter. 

increase. Speaking loosely, we can say that our visual system is try¬ 
ing to construct a representation of what it is looking at based more 
on relative differences in the luminance (or brightness) of the bars 
rather than on their absolute value. Similarly, the ghostly blobs in 
the Hermann grid effect can be thought of as a side-effect of the 
visual system being tuned for a different task. 

These sorts of effects extend to the role of background con¬ 
trasts. The same shade of gray will be perceived differently depend¬ 
ing on whether it is against a dark background or a light one. Our 
ability to distinguish shades of brightness is not uniform either. 

We are better at distinguishing dark shades than we are at distin¬ 
guishing light ones. The effects interact, too. We will do better at 
distinguishing very light shades of gray when they are set against 
a light background. When set against a dark background, differ¬ 
ences in the middle range of the light-to-dark spectrum are easier 
to distinguish. 

Our visual system is attracted to edges, and we assess con¬ 
trast and brightness in terms of relative rather than absolute values. 

Some of the more spectacular visual effects exploit our mostly suc¬ 
cessful efforts to construct representations of surfaces, shapes, and 
objects based on what we are seeing. Edward Adelsons checker- 
shadow illusion, shown in figure 1.15, is a good example. Though 
hard to believe, the squares marked “A” and “B” are the same shade 
of gray. 

To figure out the shade of the squares on the floor, we compare 
it to the nearby squares, and we also discount the shadows cast 
by other objects. Even though a light-colored surface in shadow 
might reflect less light than a dark surface in direct light, it would 
generallybe an error to infer that the surface in the shade really was 
a darker color. The checkerboard image is carefully constructed to 
exploit these visual inferences made based on local contrasts in 
brightness and the information provided by shadows. As Adelson 
(1995) notes, “The visual system is not very good at being a 
physical light meter, but that is not its purpose.” Because it has 



16 . Chapter 1 





Figure 1.16: Edge contrasts in monochrome and 
color, after Ware (2008). 


evolved to be good at perceiving real objects in its environment, 
we need to be aware of how it works in settings where we are using 
it to do other things, such as keying variables to some spectrum of 
grayscale values. 

An important point about visual effects of this kind is that 
they are not illusions in the way that a magic trick is an illusion. 
If a magician takes you through an illusion step by step and shows 
you how it is accomplished, then the next time you watch the trick 
performed you will see through it and notice the various bits of 
misdirection and sleight of hand that are used to achieve the effect. 
But the most interesting visual effects are not like this. Even after 
they have been explained to you, you cannot stop seeing them, 
because the perceptual processes they exploit are not under your 
conscious control. This makes it easy to be misled by them, as 
when (for example) we overestimate the size of a contrast between 
two adjacent shaded areas on a map or grid simply because they 
share a boundary. 

Our ability to see edge contrasts is stronger for monochrome 
images than for color. Figure 1.16, from Ware (2008, 71), shows 
an image of dunes. In the red-green version, the structure of the 
landscape is hard to perceive. In the grayscale version, the dunes 
and ridges are much more easily visible. 

Using color in data visualization introduces a number of other 
complications (Zeileis & Hornik 2006). The central one is related 
to the relativity of luminance perception. As we have been dis¬ 
cussing, our perception of how bright something looks is largely 
a matter of relative rather than absolute judgments. How bright a 
surface looks depends partly on the brightness of objects near it. 
In addition to luminance, the color of an object can be thought 
of has having two other components. First, an object’s hue is what 
we conventionally mean when we use the word “color”: red, blue, 
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green, purple, and so on. In physical terms it can be thought of 
as the dominant wavelength of the light reflected from the object’s 
surface. The second component is chrominance or chroma. This is 
the intensity or vividness of the color. 

To produce color output on screens or in print we use various 
color models that mix together color components to get specific 
outputs. Using the RGB model, a computer might represent color 
in terms of mixtures of red, green, and blue components, each of 
which can take a range of values from 0 to 255. When using colors 
in a graph, we are mapping some quantity or category in our data 
to a color that people see. We want that mapping to be “accurate” in 
some sense, with respect to the data. This is partly a matter of the 
mapping being correct in strictly numerical terms. For instance, 
we want the gap between two numerical values in the data to be 
meaningfully preserved in the numerical values used to define the 
colors shown. But it is also partly a matter of how that mapping 
will be perceived when we look at the graph. 

For example, imagine we had a variable that could take val¬ 
ues from 0 to 5 in increments of 1, with zero being the lowest 
value. It is straightforward to map this variable to a set of RGB col¬ 
ors that are equally distant from one another in purely numerical 
terms in our color space. The wrinkle is that many points that are 
equidistant from each other in this sense will not be perceived as 
equally distant by people looking at the graph. This is because our 
perception is not uniform across the space of possible colors. For 
instance, the range of chroma we are able to see depends strongly 
on luminance. If we pick the wrong color palette to represent our 
data, for any particular gradient the same-sized jump between one 
value and another (e.g., from 0 to 1, as compared to from 3 to 4) 
might be perceived differently by the viewer. This also varies across 
colors, in that numerically equal gaps between a sequences of 
reds (say) are perceived differently from the same gaps mapped to 
blues. 

When choosing color schemes, we will want mappings from 
data to color that are not just numerically but also perceptually uni¬ 
form. R provides color models and color spaces that try to achieve 
this. Figure 1.17 shows a series of sequential gradients using the 
HCL (hue-chroma-luminance) color model. The grayscale gradi¬ 
ent at the top varies by luminance only. The blue palette varies by 
luminance and chrominance, as the brightness and the intensity of 
the color vary across the spectrum. The remaining three palettes 


Sequential grayscale 


Sequential blue to gray 

■1 

Sequential terrain 


Diverging 


Unordered hues 


Figure 1.17: Five palettes generated from R's color 
space library. From top to bottom, the sequential 
grayscale palette varies only in luminance, or 
brightness. The sequential blue palette varies in 
both luminance and chrominance (or intensity). 

The third sequential palette varies in luminance, 
chrominance, and hue. The fourth palette is 
diverging, with a neutral midpoint. The fifth features 
balanced hues, suitable for unordered categories. 
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vary by luminance, chrominance, and hue. The goal in each case is 
to generate a perceptually uniform scheme, where hops from one 
level to the next are seen as having the same magnitude. 

Gradients or sequential scales from low to high are one of 
three sorts of color palettes. When we are representing a scale 
with a neutral midpoint (as when we are showing temperatures, 
for instance, or variance in either direction from a zero point or a 
mean value), we want a diverging scale, where the steps away from 
the midpoint are perceptually even in both directions. The blue-to- 
red palette in figure 1.17 displays an example. Finally, perceptual 
uniformity matters for unordered categorical variables as well. We 
often use color to represent data for different countries, or politi¬ 
cal parties, or types of people, and so on. In those cases we want 
the colors in our qualitative palette to be easily distinguishable but 
also have the same valence for the viewer. Unless we are doing it 
deliberately, we do not want one color to perceptually dominate 
the others. The bottom palette in figure 1.17 shows an example of 
a qualitative palette that is perceptually uniform in this way. 

The upshot is that we should generally not pick colors in an ad 
hoc way. It is too easy to go astray. In addition to the considerations 
we have been discussing, we also want to avoid producing plots 
that confuse people who are color-blind, for example. Fortunately, 
almost all the work has been done for us already. Different color 
spaces have been defined and standardized in ways that account 
for these uneven or nonlinear aspects of human color percep- 

The body responsible for this is the appropriately tion. R and ggplot make these features available to us for free. The 

authoritative-sounding Commission Internationale default palettes we will be using in ggplot are perceptually uni- 

de I'Eclairage, or International Commission on 

illumination. form in the right way. If we want to get more adventurous later, 

the tools are available to produce custom palettes that still have 
desirable perceptual qualities. Our decisions about color will focus 
more on when and how it should be used. As we are about to 
see, color is a powerful channel for picking out visual elements of 
interest. 


Preattentive search and what "pops" 

Some objects in our visual field are easier to see than others. They 
pop out at us from whatever they are surrounded by. For some 
kinds of object, or through particular channels, this can happen 
very quickly. Indeed, from our point of view it happens before or 
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Figure 1.18: Searching for the blue circle becomes progressively harder. 


almost before the conscious act of looking at or for something. 
The general term for this is “preattentive pop-out,” and there is 
an extensive experimental and theoretical literature on it in psy¬ 
chology and vision science. As with the other perceptual processes 
we have been discussing, the explanation for what is happening is 
or has been a matter of debate, up to and including the degree to 
which the phenomenon really is “preattentive,” as discussed, for 
example, by Treisman & Gormican (1988) or Nakayama & Joseph 
(1998). But it is the existence of pop-out that is relevant to us, 
rather than its explanation. Pop-out makes some things on a data 
graphic easier to see or find than others. 

Consider the panels in figure 1.18. Each one of them contains 
a single blue circle. Think of it as an observation of interest. Read¬ 
ing left to right, the first panel contains twenty circles, nineteen 
of which are yellow and one blue. The blue circle is easy to find, 
as there are a relatively small number of observations to scan, and 
their color is the only thing that varies. The viewer barely has to 
search consciously at all before seeing the dot of interest. 

In the second panel, the search is harder, but not that much 
harder. There are a hundred dots now, five times as many, but again 
the blue dot is easily found. The third panel again has only twenty 
observations. But this time there is no variation on color. Instead 
nineteen observations are triangles and one is a circle. On average, 
looking for the blue dot is noticeably harder than searching for it in 
the first panel, and it may even be more difficult than in the second 
panel despite there being many fewer observations. 

Think of shape and color as two distinct channels that can be 
used to encode information visually. It seems that pop-out on the 
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color channel is stronger than it is on the shape channel. In the 
fourth panel, the number of observations is again upped to one 
hundred. Finding the single blue dot may take noticeably longer. 
If you don’t see it on the first or second pass, it may require a 
conscious effort to systematically scan the area in order to find it. 
It seems that search performance on the shape channel degrades 
much faster than on the color channel. 

Finally the fifth panel mixes color and shape for a large num¬ 
ber of observations. Again there is only one blue dot on the graph, 
but annoyingly there are many blue triangles and yellow dots that 
make it harder to find what we are looking for. Dual- or multiple- 
channel searches for large numbers of observations can be very 
slow. 

Similar effects can be demonstrated for search across other 
channels (for instance, with size, angle, elongation, and move¬ 
ment) and for particular kinds of searches within channels. For 
example, some kinds of angle contrasts are easier to see than oth¬ 
ers, as are some kinds of color contrasts. Ware (2008, 27-33) has 
more discussion and examples. The consequences for data visual¬ 
ization are clear enough. As shown in figure 1.19, adding multiple 
channels to a graph is likely to quickly overtax the capacity of 
the viewer. Even if our software allows us to, we should think 
carefully before representing different variables and their values 
by shape, color, and position all at once. It is possible for there 
to be exceptions, in particular (as shown in the second panel of 
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Figure 1.19: Multiple channels become uninterpretable very fast (/eft), unless your data has a great deal of structure (right). 




figure 1.19) if the data shows a great deal of structure to begin with. 
But even here, in all but the most straightforward cases a different 
visualization strategy is likely to do better. 
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Gestalt rules 


At first glance, the points in the pop-out examples in figure 1.18 
might seem randomly distributed within each panel. In fact, they 
are not quite randomly located. Instead, I wrote a little code to lay 
them out in a way that spread them around the plotting area but 
prevented any two points from completely or partially overlapping 
each other. I did this because I wanted the scatterplots to be pro¬ 
grammatically generated but did not want to take the risk that the 
blue dot would end up plotted underneath one of the other dots 
or triangles. It’s worth taking a closer look at this case, as there is a 
lesson here for how we perceive patterns. 

Each panel in figure 1.20 shows a field of points. There are 
clearly differences in structure between them. The first panel 
was produced by a two-dimensional Poisson point process and is 
“properly” random. (Defining randomness, or ensuring that a pro¬ 
cess really is random, turns out to be a lot harder than you might 
think. But we gloss over those difficulties here.) The second panel 
was produced from a Matern model, a specification often found in 
spatial statistics and ecology. In a model like this points are again 
randomly distributed but are subject to some local constraints. 
In this case, after randomly generating a number of candidate 
points in order, the field is pruned to eliminate any point that 
appears too close to a point that was generated before it. We can 
tune the model to decide how close is “too close.” The result is a 
set of points that are evenly spread across the available space. 

If you ask people which of these panels has more structure in 
it, they will tend to say the Poisson field. We associate randomness 
with a relatively even distribution across a space. But in fact, a ran¬ 
dom process like this is substantially more clumpy than we tend to 
think. I first saw a picture of this contrast in an essay by Stephen 
Jay Gould (1991). There the Matern-like model was used as a rep¬ 
resentation of glowworms on the wall of a cave in New Zealand. It’s 
a good model for that case because if one glowworm gets too close 
to another, it’s liable to get eaten. Hence the relatively even—but 
not random—distribution that results. 


Poisson 



Matern 



Figure 1.20: Each panel shows simulated data. 

The upper panel shows a random point pattern 
generated by a Poisson process. The lower panel 
is from a Matern model, where new points are 
randomly placed but cannot be too near already- 
existing ones. Most people see the Poisson¬ 
generated pattern as having more structure, or less 
"randomness," than the Matern, whereas the reverse 
is true. 
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Figure 1.21: Gestalt inferences: proximity, similarity, 
connection, common fate. The layout of the figure 
employs some of these principles, in addition to 
displaying them. 
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We look for structure all the time. We are so good at it that we 
will find it in random data, given time. (This is one of the reasons 
that data visualization can hardly be a replacement for statisti¬ 
cal modeling.) The strong inferences we make about relationships 
between visual elements from relatively sparse visual information 
are called “gestalt rules.” They are not pure perceptual effects like 
the checkerboard illusions. Rather, they describe our tendency to 
infer relationships between the objects we are looking at in a way 
that goes beyond what is strictly visible. Figure 1.21 provides some 
examples. 

What sorts of relationships are inferred, and under what cir¬ 
cumstances? In general we want to identify groupings, classifica¬ 
tions, or entities than can be treated as the same thing or part of 
the same thing: 

. Proximity: Things that are spatially near to one another seem 
to be related. 

. Similarity: Things that look alike seem to be related. 

. Connection: Things that are visually tied to one another seem 
to be related. 

. Continuity: Partially hidden objects are completed into famil¬ 
iar shapes. 

. Closure: Incomplete shapes are perceived as complete. 

. Figure and ground: Visual elements are taken to be either in the 
foreground or in the background. 

. Common fate: Elements sharing a direction of movement are 
perceived as a unit. 











Some kinds of visual cues outweigh others. For example, in 
the upper left of figure 1.21, the circles are aligned horizontally 
into rows, but their proximity by column takes priority, and we see 
three groups of circles. In the upper right, the three groups are still 
salient but the row of blue circles is now seen as a grouped entity. In 
the middle row of the figure, the left side shows mixed grouping by 
shape, size, and color. Meanwhile the right side of the row shows 
that direct connection outweighs shape. Finally the two schematic 
plots in the bottom row illustrate both connection and common 
fate, in that the lines joining the shapes tend to be read left-to- 
right as part of a series. Note also the points in the lower right plot 
where the lines cross. There are gaps in the line segments joining 
the circles, but we perceive this as them “passing underneath” the 
lines joining the triangles. 


1.4 Visual Tasks and Decoding Graphs 

The workings of our visual system and our tendency to make infer¬ 
ences about relationships between visible elements form the basis 
of our ability to interpret graphs of data. There is more involved 
besides that, however. Beyond core matters of perception lies the 
question of interpreting and understanding particular kinds of 
graphs. The proportion of people who can read and correctly inter¬ 
pret a scatterplot is lower than you might think. At the intersec¬ 
tion of perception and interpretation there are specific visual 
tasks that people need to perform in order to properly see the 
graph in front of them. To understand a scatterplot, for example, 
the viewer needs to know a lot of general information, such as 
what a variable is, what the x-y coordinate plane looks like, why 
we might want to compare two variables on it, and the conven¬ 
tion of putting the supposed cause or “independent” variable on 
the x-axis. Even if viewers understand all these things, they must 
still perform the visual task of interpreting the graph. A scatterplot 
is a visual representation of data, not a way to magically transmit 
pure understanding. Even well-informed viewers may do worse 
than we think when connecting the picture to the underlying data 
(Doherty, et al. 2007; Rensink & Baldridge 2010). 

In the 1980s William S. Cleveland and Robert McGill con¬ 
ducted some experiments identifying and ranking theses tasks for 
different types of graphics (Cleveland & McGill, 1984, 1987). Most 
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Figure 1.22: Schematic representation of basic perceptual tasks for nine chart types, by Heer and Bostock, following Cleveland and McGill. In both 
studies, participants were asked to make comparisons of highlighted portions of each chart type and say which was smaller. 


often, research subjects were asked to estimate two values within a 
chart (e.g., two bars in a bar chart, or two slices of a pie chart) or 
compare values between charts (e.g., two areas in adjacent stacked 
bar charts). Cleveland went on to apply the results of this work, 
developing the trellis display system for data visualization in S, the 
statistical programming language developed at Bell Labs. (R is a 
later implementation of S.) He also wrote two excellent books that 
describe and apply these principles (Cleveland 1993, 1994). 

In 2010 Heer & Bostock replicated Cleveland’s earlier exper¬ 
iments and added a few assessments, including evaluations of 
rectangular-area graphs, which have become more popular in 
recent years. These include treemaps, where a square or rectan¬ 
gle is subdivided into further rectangular areas representing some 
proportion or percentage of the total. It looks a little like a stacked 
bar chart with more than one column. The comparisons and graph 
types made by their research subjects are shown schematically in 
figure 1.22. For each graph type, subjects were asked to identify 
the smaller of two marked segments on the chart and then to 
“make a quick visual judgment” estimating what percentage the 
smaller one was of the larger. As can be seen from the figure, the 
charts tested encoded data in different ways. Types 1-3 use posi¬ 
tion encoding along a common scale while types 4 and 5 use length 
encoding. The pie chart encodes values as angles, and the remain¬ 
ing charts as areas, using either circular, separate rectangles (as in 
a cartogram) or subrectangles (as in a treemap). 

Their results are shown in figure 1.23, along with Cleveland 
and McGill’s original results for comparison. The replication was 
quite good. The overall pattern of results seems clear, with per¬ 
formance worsening substantially as we move away from compar¬ 
ison on a common scale to length-based comparisons to angles 
and finally areas. Area comparisons perform even worse than the 
(justifiably) maligned pie chart. 






































Look at Data . 25 


Cleveland & McGill's results 

Figure 1.23: Cleveland and McGill's original results 

“•-1 (f op) and Heer and Bostock's replication with 
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These findings, and other work in this tradition, strongly sug¬ 
gest that there are better and worse ways of visually representing 
data when the task the user must perform involves estimating and 
comparing values within the graph. Think of this as a “decoding” 
operation that the viewer must perform in order to understand 
the content. The data values were encoded or mapped in to the 
graph, and now we have to get them back out again. When doing 
this, we do best judging the relative position of elements aligned 
on a common scale, as, for example, when we compare the heights 
of bars on a bar chart, or the position of dots with reference to a 
fixed x- or y-axis. When elements are not aligned but still share a 
scale, comparison is a little harder but still pretty good. It is more 
difficult again to compare the lengths of lines without a common 
baseline. 

Outside of position and length encodings, things generally 
become harder and the decoding process is more error prone. We 
tend to misjudge quantities encoded as angles. The size of acute 
angles tends to be underestimated, and the size of obtuse angles 
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overestimated. This is one reason pie charts are usually a bad idea. 
We also misjudge areas poorly. We have known for a long time that 
area-based comparisons of quantities are easily misinterpreted or 
exaggerated. For example, values in the data might be encoded as 
lengths, which are then squared to make the shape on the graph. 
The result is that the difference in size between the squares or rect¬ 
angles area will be much larger than the difference between the two 
numbers they represent. 

Comparing the areas of circles is prone to more error again, for 
the same reason. It is possible to offset these problems somewhat 
by choosing a more sophisticated method for encoding the data 
as an area. Instead of letting the data value be the length of the 
side of a square or the radius of the circle, for example, we could 
map the value directly to area and back-calculate the side length or 
radius. Still, the result will generally not be as good as alternatives. 
These problems are further compounded for “three-dimensional” 
shapes like blocks, cylinders, or spheres, which appear to represent 
volumes. And as saw with the 3-D bar chart in figure 1.10, the 
perspective or implied viewing angle that accompanies these kinds 
of charts creates other problems when it comes to reading the scale 
on a y-axis. 

Finally, we find it hard to judge changes in slope. The estima¬ 
tion of rates of change in lines or trends is strongly conditioned 
by the aspect ratio of the graph, as we saw in figure 1.12. Our 
relatively weak judgment of slopes also interacts badly with three- 
dimensional representations of data. Our ability to scan the “away” 
dimension of depth (along the z-axis) is weaker than our ability to 
scan the x- and y-axes. For this reason, it can be disproportion¬ 
ately difficult to interpret data displays of point clouds or surfaces 
displayed with three axes. They can look impressive, but they are 
also harder to grasp. 


1.5 Channels for Representing Data 

Graphical elements represent our data in ways that we can see. Dif¬ 
ferent sorts of variables attributes can be represented more or less 
well by different kinds of visual marks or representations, such 
as points, lines, shapes, and colors. Our task is to come up with 
methods that encode or map variables in the right way. As we do 
this, we face several constraints. First, the channel or mapping that 
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we choose needs to be capable of representing the kind of data that 
we have. If we want to pick out unordered categories, for example, 
choosing a continuous gradient to represent them will not make 
much sense. If our variable is continuous, it will not be helpful to 
represent it as a series of shapes. 

Second, given that the data can be comprehensibly repre¬ 
sented by the visual element we choose, we will want to know how 
effective that representation is. This was the goal of Cleveland’s 
research. Following Tamara Munzer (2014, 101-3), Figures 1.24 
and 1.25 present an approximate ranking of the effectiveness of 
different channels for ordered and unordered data, respectively. If 
we have ordered data and we want the viewer to efficiently make 
comparisons, then we should try to encode it as a position on 
a common scale. Encoding numbers as lengths (absent a scale) 
works too, but not as effectively. Encoding them as areas will make 
comparisons less accurate again, and so on. 

Third, the effectiveness of our graphics will depend not just on 
the channel that we choose but on the perceptual details of how we 
implement it. So, if we have a measure with four categories ordered 
from lowest to highest, we might correctly decide to represent it 
using a sequence of colors. But if we pick the wrong sequence, 
the data will still be hard to interpret, or actively misleading. In 
a similar way, if we pick a bad set of hues for an unordered cate¬ 
gorical variable, the result might not just be unpleasant to look at 
but actively misleading. 

Finally, bear in mind that these different channels or mappings 
for data are not in themselves kinds of graphs. They are just the ele¬ 
ments or building blocks for graphs. When we choose to encode 
a variable as a position, a length, an area, a shade of gray, or a 
color, we have made an important decision that narrows down 
what the resulting plot can look like. But this is not the same as 
deciding what type of plot it will be, in the sense of choosing 
whether to make a dotplot or a bar chart, a histogram or a fre¬ 
quency polygon, and so on. 


1.6 Problems of Honesty and Good Judgment 
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Figure 1.24: Channels for mapping ordered data 
(continuous or other quantitative measures), 
arranged top to bottom from more to less effective, 
after Munzer (2014,102). 


Figure 1.26 shows two ways of redrawing our life expectancy figure 
(fig. 1.4). Each of these plots is far less noisy than the junk-filled 
monstrosity we began with. But they also have design features that 
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Figure 1.25: Channels for mapping unordered 
categorical data, arranged top-to-bottom from more 
to less effective, after Munzer (2014,102). 
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Figure 1.26: Two simpler versions of our junk chart. 
The scale on the bar chart version goes to zero, 
while the scale on the dotplot version is confined to 
the range of values taken by the observations. 
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could be argued over and might even matter substantively depend¬ 
ing on the circumstances. For example, consider the scales on the 
x-axis in each case. The left-hand panel in figure 1.26 is a bar chart, 
and the length of the bar represents the value of the variable “aver¬ 
age life expectancy in 2007” for each continent. The scale starts at 
zero and extends to j ust beyond the level of the largest value. Mean¬ 
while the right-hand panel is a Cleveland dotplot. Each observa¬ 
tion is represented by a point, and the scale is restricted to the range 
of the data as shown. 

It is tempting to lay down inflexible rules about what to do in 
terms of producing your graphs, and to dismiss people who don’t 
follow them as producing junk charts or lying with statistics. But 
being honest with your data is a bigger problem than can be solved 
by rules of thumb about making graphs. In this case there is a mod¬ 
erate level of agreement that bar charts should generally include a 
zero baseline (or equivalent) given that bars make lengths salient 
to the viewer. But it would be a mistake to think that a dotplot 
was by the same token deliberately misleading, just because it kept 
itself to the range of the data instead. 

Which one is to be preferred? It is tricky to give an unequivocal 
answer, because the reasons for preferring one type of scaling over 
another depend in part on how often people actively try to mis¬ 
lead others by preferring one sort of representation over another. 
On the one hand, there is a lot of be said in favor of showing the 
data over the range we observe it, rather than forcing every scale 
to encompass its lowest and highest theoretical value. Many oth¬ 
erwise informative visualizations would become useless if it was 
mandatory to include a zero point on the x- or y-axis. On the 
other hand, it’s also true that people sometimes go out of their 
way to restrict the scales they display in a way that makes their 
argument look better. Sometimes this is done out of active mal¬ 
ice, other times out of passive bias, or even just a hopeful desire to 
see what you want to see in the data. (Remember, often the main 
audience for your visualizations is yourself.) In those cases, the 
resulting graphic will indeed be misleading. 

Rushed, garish, and deliberately inflammatory or misleading 
graphics are a staple of social media sharing and the cable news 
cycle. But the problem comes up in everyday practice as well, and 
the two can intersect if your work ends up in front of a public 
audience. For example, let’s take a look at some historical data 
on law school enrollments. A decline in enrollments led to some 
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reporting on trends since the early 1970s. The results are shown in 
figure 1.27. 

The first panel shows the trend in the number of students 
beginning law school each year since 1973. The y-axis starts from 
just below the lowest value in the series. The second panel shows 
the same data but with the y-axis minimum set to zero instead. The 
columnist and writer Justin Fox saw the first version and remarked 
on how amazing it was. He was then quite surprised at the strong 
reactions he got from people who insisted the y-axis should have 
included zero. The original chart was “possibly... one of the worst 
represented charts I’ve ever seen,” said one interlocutor. Another 
remarked that “graphs that don’t go to zero are a thought crime” 
(Fox 2014). 

My own view is that the chart without the zero baseline shows 
you that, after almost forty years of mostly rising enrollments, law 
school enrollments dropped suddenly and precipitously around 
2011 to levels not seen since the early 1970s. The levels are clearly 
labeled, and the decline does look substantively surprising and sig¬ 
nificant. In a well-constructed chart the axis labels are a necessary 
guide to the reader, and we should expect readers to pay atten¬ 
tion to them. The chart with the zero baseline, meanwhile, does 
not add much additional information beyond reminding you, at 
the cost of wasting some space, that 35,000 is a number quite a lot 
larger than zero. 

That said, I am sympathetic to people who got upset at the first 
chart. At a minimum, it shows they know to read the axis labels on 
a graph. That is less common than you might think. It likely also 
shows they know interfering with the axes is one way to make a 
chart misleading, and that it is not unusual for that sort of thing to 
be done deliberately. 




Figure 1.27: Two views of the rapid decline in law 
school enrollments in the mid-2010s. 


1.7 Think Clearly about Graphs 

I am going to assume that your goal is to draw effective graphs 
in an honest and reproducible way. Default settings and general 
rules of good practice have limited powers to stop you from doing 
the wrong thing. But one thing they can do is provide not just 
tools for making graphs but also a framework or set of concepts 
that helps you think more clearly about the good work you want 
to produce. When learning a graphing system or toolkit, people 
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often start thinking about specific ways they want their graph to 
look. They quickly start formulating requests. They want to know 
how to make a particular kind of chart, or how to change the type¬ 
face for the whole graph, or how to adjust the scales, or how to 
move the title, customize the labels, or change the colors of the 
points. 

These requests involve different features of the graph. Some 
have to do with basic features of the figure’s structure, with which 
bits of data are encoded as or mapped to elements such as shape, 
line, or color. Some have to do with the details of how those ele¬ 
ments are represented. If a variable is mapped to shape, which 
shapes will be chosen, exactly? If another variable is represented 
by color, which colors in particular will be used? Some have to do 
with the framing or guiding features of the graph. If there are tick- 
marks on the x-axis, can I decide where they should be drawn? If 
the chart has a legend, will it appear to the right of the graph or on 
top? If data points have information encoded in both shape and 
color, do we need a separate legend for each encoding, or can we 
combine them into a single unified legend? And some have to do 
with thematic features of the graph that may greatly affect how the 
final result looks but are not logically connected to the structure of 
the data being represented. Can I change the title font from Times 
New Roman to Helvetica? Can I have a light blue background in 
all my graphs? 

A real strength of ggplot is that it implements a grammar 
of graphics to organize and make sense of these different ele¬ 
ments (Wilkinson 2005). Instead of a huge, conceptually flat list 
of options for setting every aspect of a plot’s appearance at once, 
ggplot breaks up the task of making a graph into a series of distinct 
tasks, each bearing a well-defined relationship to the structure of 
the plot. When you write your code, you carry out each task using a 
function that controls that part of the job. At the beginning, ggplot 
will do most of the work for you. Only two steps are required. First, 
you must give some information to the ggplotQ function. This 
establishes the core of the plot by saying what data you are using 
and what variables will be linked or mapped to features of the plot. 
Second, you must choose a geom_ function. This decides what sort 
of plot will be drawn, such as a scatterplot, a bar chart, or a boxplot. 

As you progress, you will gradually use other functions to gain 
more fine-grained control over other features of the plot, such as 
scales, legends, and thematic elements. This also means that, as 



you learn ggplot, it is very important to grasp the core steps first, 
before worrying about adjustments and polishing. And so that is 
how we’ll proceed. In the next chapter we will learn how to get 
up and running in R and make our first graphs. From there, we 
will work through examples that introduce each element of ggplot’s 
way of doing things. We will be producing sophisticated plots quite 
quickly, and we will keep working on them until we are in full con¬ 
trol of what we are doing. As we go, we will learn about some ideas 
and associated techniques and tricks to make R do what we want. 


1.8 Where to Go Next 

For an entertaining and informative overview of various visual 
effects and optical “illusions,” take a look at Michael Bach’s web¬ 
site at michaelbach. de. If you would like to learn more about the 
relationship between perception and data visualization, follow up 
on some of the references in this chapter. Munzer (2014), Ware 
(2008), and Few (2009) are good places to start. William Cleve¬ 
land’s books (1993,1994) are models of clarity and good advice. As 
we shall see beginning in the next chapter, the ideas developed in 
Wilkinson (2005) are at the heart of ggplot’s approach to visualiza¬ 
tion. Finally, foundational work by Bertin (2010) lies behind a lot 
of thinking on the relationship between data and visual elements. 



2 Get Started 


In this chapter, we will begin to learn how to create pictures of data 
that people, including ourselves, can look at and learn from. R and 
ggplot are the tools we will use. The best way to learn them is to 
follow along and repeatedly write code as you go. The material in 
this book is designed to be interactive and hands-on. If you work 
through it with me using the approach described below, you will 
end up with a book much like this one, with many code samples 
alongside your notes, and the figures or other output produced by 
that code shown nearby. 

I strongly encourage you to type out your code rather than 
copying and pasting the examples from the text. Typing it out will 
help you learn it. At the beginning it may feel like tedious tran¬ 
scription you don’t fully understand. But it slows you down in a 
way that gets you used to what the syntax and structure of R are 
like and is a very effective way to learn the language. It’s especially 
useful for ggplot, where the code for our figures will repeatedly 
have a similar structure, built up piece by piece. 


2.1 Work in Plain Text, Using RMarkdown 

When taking notes, and when writing your own code, you should 
write plain text in a text editor. Do not use Microsoft Word or some 
other word processor. You may be used to thinking of your final 
outputs (e.g., a Word file, a PDF document, presentation slides, or 
the tables and figures you make) as what’s “real” about your project. 
Instead, it’s better to think of the data and code as what’s real, 
together with the text you write. The idea is that all your finished 
output—your figures, tables, text, and so on—can be procedurally 
and reproducibly generated from code, data, and written material 
stored in a simple, plain-text format. 

The ability to reproduce your work in this way is important 
to the scientific process. But you should also see it as a pragmatic 
choice that will make life easier for you in the future. The reality 



for most of us is that the person who will most want to easily 
reproduce your work is you, six months or a year from now. This 
is especially true for graphics and figures. These often have a 
“finished” quality to them, as a result of much tweaking and adjust¬ 
ments to the details of the figure. That can make them hard to 
reproduce later. While it is normal for graphics to undergo a sub¬ 
stantial amount of polishing on their way to publication, our goal 
is to do as much of this as possible programmatically, in code 
we write, rather than in a way that is retrospectively invisible, as, 
for example, when we edit an image in an application like Adobe 
Illustrator. 

While learning ggplot, and later while doing data analysis, you 
will find yourself constantly pinging back and forth between three 
things: 

1. Writing code. You will write a lot of code to produce plots. You 
will also write code to load your data and to look quickly at 
tables of that data. Sometimes you will want to summarize, 
rearrange, subset, or augment your data, or run a statistical 
model with it. You will want to be able to write that code as 
easily and effectively as possible. 

2. Looking at output. Your code is a set of instructions that, when 
executed, produces the output you want: a table, a model, or 
a figure. It is often helpful to be able to see that output and its 
partial results. While were working, it’s also useful to keep the 
code and the things produced by the code close together, if 
we can. 

3. Taking notes. You will also be writing about what we are doing 
and what your results mean. When learning how to do some¬ 
thing in ggplot, for instance, you will want to make notes to 
yourself about what you did, why you wrote it this way rather 
than that, or what this new concept, function, or instruction 
does. Later, when doing data analysis and making figures, you 
will be writing up reports or drafting papers. 

How can you do all this effectively? The simplest way to keep 
code and notes together is to write your code and intersperse it 
with comments. All programming languages have some way of 
demarcating lines as comments, usually by putting a special char¬ 
acter (like #) at the start of the line. We could create a plain-text 
script file called, e.g., notes. r, containing code and our comments 
on it. This is fine as far as it goes. But except for very short files, it 
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Figure 2.1: Top: Some elements of RMarkdown 
syntax. Bottom: From a plain text RMarkdown file 
to PDF output. 


The format is language agnostic and can be used 
with, e.g., Python and other languages. 


will be difficult to do anything useful with the comments we write. 
If we want a report from an analysis, for example, we will have to 
write it up separately. While a script file can keep comments and 
code together, it loses the connection between code and its out¬ 
put, such as the figure we want to produce. But there is a better 
alternative: we can write our notes using RMarkdown. 

An RMarkdown file is just a plain text document where text 
(such as notes or discussion) is interspersed with pieces, or chunks, 
of R code. When you feed the document to R, it knits this file into a 
new document by running the R code piece by piece, in sequence, 
and either supplementing or replacing the chunks of code with 
their output. The resulting file is then converted into a more read¬ 
able document formatted in HTML, PDF, or Word. The noncode 
segments of the document are plain text, but they can have simple 
formatting instructions in them. These are set using Markdown, a 
set of conventions for marking up plain text in a way that indicates 
how it should be formatted. The basic elements of Markdown are 
shown in the upper part of figure 2.1. When you create a mark¬ 
down document in R Studio, it contains some sample text to get 
you started. 

RMarkdown documents look like the one shown schemati¬ 
cally in the lower part of figure 2.1. Your notes or text, with Mark¬ 
down formatting as needed, are interspersed with code. There is a 
set format for code chunks. They look like this: 


'"{r} 


Three backticks (on a U.S. keyboard, that’s the character under the 
escape key) are followed by a pair of curly braces containing the 
name of the language we are using. The backticks-and-braces part 
signals that a chunk of code is about to begin. You write your code 
as needed and then end the chunk with a new line containing just 
three backticks. 

If you keep your notes in this way, you will be able to see the 
code you wrote, the output it produces, and your own commentary 
or clarification on it in a convenient way. Moreover, you can turn 
it into a good-looking document right away. 
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R version 3.4.1 (2017-06-30) — "Single Candle" 

Copyright (C) 2017 The R Foundation for Statistical Computing 
Platform: x86_64-apple-darwinl5.6.0 (64-bit) 

Figure 2.2: Bare-bones R running from the Terminal. 

R is free software and comes with ABSOLUTELY NO WARRANTY. 

You are welcome to redistribute it under certain conditions. 

Type 'license()' or 'licenceO' for distribution details. 

Natural language support but running in an English locale 

R is a collaborative project with many contributors. 

Type 'contributors() 1 for more information and 
1 citationO* on how to cite R or R packages in publications. 

Type 'demoO 1 for some demos, 'helpO' for on-line help, or 
1 help.start()' for an HTML browser interface to help. 

Type 'q() 1 to quit R. 


2.2 Use R with RStudio 
The RStudio environment 

R itself is a relatively small application with next to no user inter¬ 
face. Everything works through a command line, or console. At 
its most basic, you launch it from your Terminal application (on 
a Mac) or Command Prompt (on Windows) by typing R. Once 
launched, R awaits your instructions at a command line of its own, 
denoted by the right angle bracket symbol, > (fig. 2.2). When you 
type an instruction and hit return, R interprets it and sends any 
resulting output back to the console. 

In addition to interacting with the console, you can also write 
your code in a text file and send that to R all at once. You can use 
any good text editor to write your . r scripts. But although a plain 
text file and a command line are the absolute minimum you need 
to work with R, it is a rather spartan arrangement. We can make 
life easier for ourselves by using RStudio. RStudio is an “integrated 
development environment,” or IDE. It is a separate application 
from R proper. When launched, it starts up an instance of R’s con¬ 
sole inside ofitself. It also conveniently pulls together various other 
elements to help you get your work done. These include the doc¬ 
ument where you are writing your code, the output it produces, 
and R’s help system. RStudio also knows about RMarkdown and 
understands a lot about the R language and the organization of 
your project. When you launch RStudio, it should look much like 
figure 2.3. 
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RStudio File Edit Code View Plots Session Build Debug Profile Tools Window Help 


• • • 



-/Documents/courses/dataviz/sample - RStudio 


O - Oh ^ H 


* Addins - 


sample — dalaviz - 


Console Terminal 

-/Documents/courses/datavlz/sample/ 

R version 3.4.1 (2017-06-30) -- "Single Candle" 

Copyright (C) 2017 The R Foundation for Statistical Computing 
Platform: x86_64-apple-darwinl5.6.0 (64-bit) 

R is free software and comes with ABSOLUTELY NO WARRANTY. 

You are welcome to redistribute it under certain conditions. 
Type *license()' or 'licenceQ' for distribution details. 

Natural language support but running in an English locale 

R is a collaborative project with many contributors. 

Type 'contributorsQ' for more information and 

'citation()' on how to cite R or R packages in publications. 

Type 'demo()' for some demos, 'helpQ' for on-line help, or 
1 help.start()' for an HTML browser interface to help. 

Type 'q()' to quit R. 


> library(tidyverse) 

Loading tidyverse: ggplot2 
Loading tidyverse: tibble 
Loading tidyverse: tidyr 
Loading tidyverse: readr 
Loading tidyverse: purrr 
Loading tidyverse: dplyr 

Conflicts with tidy packages - 

filterQ: dplyr, stats 
lag(): dplyr, stats 

> ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) 

'geom_smooth()' using method = 'loess' 


geom_point() + geom_smooth() 


- 1 - 

Type commands at the Console 


Environment History Connections r 

tij* fci ■** Import Dataset - ^ List - 

4 Global Environment • 

Values 

housekeeping.inst- logi [1:626] TRUE TRUE TRUE TRUE TRUE TRUE ... 
housekeeping.pkgs Named chr [1:626] "AER" "AUC" "Amelia" "BDgraph" 
old chr [1:6] "datasets" "utils" "grDevices" "graphics-. 

See results in the Plots tab 


Files Plots Packages Help Viewer 

Zoom Export - O £ 



Figure 2.3: The RStudio IDE. 


Create a project 

To begin, create a project. From the menu, choose File > New 
Project... from the menu bar, choose the New Directory option, 
You can create your new project wherever you and create the project. Once it is set up, create an RMarkdown 

like-most commonly it Will go somewhere in your file in the directory, with File > New File > RMarkdown. This 

Documents folder. 

will give you a set of choices including the default “Document.” 
The soeviz library comes with a small RMarkdown template 
that follows the structure of this book. To use it instead of the 
default document, after selecting File > New File > RMarkdown, 
choose the “From Template” option in the sidebar of the dialog 
box that appears. Then choose “Data Visualization Notes” from 
the resulting list of options. When the RMarkdown document 
appears, save it right away in your project folder, with File > Save. 
The soeviz template contains a information about how RMark¬ 
down works, together with some headers to get you started. Read 
what it has to say. Look at the code chunks and RMarkdown for¬ 
matting. Experiment with knitting the document, and compare 
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Process the 



whole document ‘am*"- ■•Run- - 

1 ’ --- 


2 title: "notes" 


3 author: "Kieran healy" 1 

1 Information about 

4 date: "9/13/2017" 

f the document 

5 output: html_document I 

6 - 

Run just 


• '"{r setup, 1nclude=FALSE} 
knitr::opts_chunk$set(echo = TRUE) 


12- ## R Markdown 


1 


this chunk 


Code chunk 




This is an R Markdown document. Markdown is a simple formatting syntax for 
authoring HTML, PDF, and MS Word documents. For more details on using R Markdown 
see <httpi//rmarkdown.rstudio.com>. 

When you click the **Knit** button a document will be generated that includes 
both content as well as the output of any embedded R code chunks within the 
document. You can embed an R code chunk like this: 


'{r cars) 
summary(cars) 


• ft# Including Plots 
You can also embed plots, for example: 


• '“'{r pressure, echo=FALSE) 
plot(pressure) 


Run all 
chunks up to 
this point 


Set options^ 
for this 
code chunk 


Note that the echo = FALSE parameter was added to the code chunk to prevent 
printing of the R code that generated the plot. 

□ notes : R Markdown : 


Figure 2.4: An RMarkdown file open in R Studio. The 
small icons in the top right-hand corner of each 
code chunk can be used to set options (the gear 
icon), run all chunks up to the current one (the 
downward-facing triangle), and just run the current 
chunk (the right-facing triangle). 


Notes and 
discussion, 
with formatting 
instructions 


the output to the content of the plain text document. Figure 2.4 
shows you what an RMarkdown file looks like when opened in 
RStudio. 

RMarkdown is not required for R. An alternative is to use an R 
script, which contains R commands only. R script files convention¬ 
ally have the extension . r or . R. (RMarkdown files conventionally 
end in .Rind.) A brief project might just need a single . r file. But 
RMarkdown is useful for documents, notes, or reports of any 
length, especially when you need to take notes. If you do use an . r 
file you can leave comments or notes to yourself by starting a line 
with the hash character, #. You can also add comments at the end 
of lines in this way, as for any particular line R will ignore whatever 
code or text that appears after a #. 

RStudio has various keyboard and menu shortcuts to help you 
edit code and text quickly. For example, you can insert chunks 
of code in your RMarkdown document with a keyboard short¬ 
cut. This saves you from writing the backticks and braces every 
time. You can run the current line of code with a shortcut, too. A 
third shortcut gives you a pop-over display with summary of many 
other useful keystroke combinations. RMarkdown documents can 
include all kinds of other options and formatting paraphernalia, 


You can create an r script via File > New File > R 
Script. 


Command+Option+I on MacOS. Ctrl+Alt+I on 
Windows. 

Command+Enter on MacOS. Alt+Enter on Windows. 

Option+Shift+K on MacOS. Alt-Shift-K on 
Windows. 
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from text formatting to cross-references to bibliographical infor¬ 
mation. But never mind about those for now. 

To make sure you are ready to go, load the tidy verse. The 
tidyverse is a suite of related packages for R developed by Hadley 
Wickham and others. The ggplot2 package is one of its com¬ 
ponents. The other pieces make it easier to get data into R and 
manipulate it once it is there. Either knit the notes file you cre¬ 
ated from the socviz template or load the packages manually at the 
console: 


library(tidyverse) 

library(socviz) 


Load the socviz package after the tidyverse. This library con¬ 
tains datasets that we will use throughout the book, along with 
some other tools that will make life easier. If you get an error mes¬ 
sage saying either package can’t be found, then reread the “Before 
You Begin” section in the preface to this book and follow the 
instructions there. 

You need to install a package only once, but you will need to 
load it at the beginning of each R session with libraryQ if you 
want to use the tools it contains. In practice this means that the 
very first lines of your working file should contain a code chunk 
that loads the packages you will need in the file. If you forget to do 
this, then R will be unable to find the functions you want to use 
later on. 


2.3 Things to Know about R 

Any new piece of software takes a bit of getting used to. This is 
especially true when using an IDE in a language like R. You are 
getting oriented to the language itself (what happens at the con¬ 
sole) while learning to take notes in what might seem like an odd 
format (chunks of code interspersed with plain-text comments), 
in an IDE that has many features designed to make your life easier 
in the long run but which can be hard to decipher at the begin¬ 
ning. Here are some general points to bear in mind about how R 
is designed. They might help you get a feel for how the language 
works. 
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Everything has a name 

In R, everything you deal with has a name. You refer to things 
by their names as you examine, use, or modify them. Named 
entities include variables (like x or y), data that you have loaded 
(like my_data), and functions that you use. (More about functions 
momentarily.) You will spend a lot of time talking about, creating, 
referring to, and modifying things with names. 

Some names are forbidden. These include reserved words like 
FALSE and TRUE, core programming words like Inf, for, else, 
break, and function, and words for special entities like NA and 
NaN. (These last two are codes designating missing data and “Not 
a Number,” respectively.) You probably won’t use these names by 
accident, but it’s good do know that they are not allowed. 

Some names you should not use even if they are techni¬ 
cally permitted. These are mostly words that are already in use 
for objects or functions that form part of the core of R. These 
include the names of basic functions like q() or c(), common 
statistical functions like mean(), rangeQ, or var(), and built-in 
mathematical constants like pi. 

Names in R are case sensitive. The object my_data is not the 
same as the object My_Data. When choosing names for things, 
be concise, consistent, and informative. Follow the style of the 
tidyverse and name things in lowercase, separating words with 
the underscore character, _, as needed. Do not use spaces when 
naming things, including variables in your data. 

Everything is an object 

Some objects are built in to R, some are added via packages, and 
some are created by the user. But almost everything is some kind of 
object. The code you write will create, manipulate, and use named 
objects as a matter of course. We can start immediately. Let’s cre¬ 
ate a vector of numbers. The command c () is a function. It’s short 
for “combine” or “concatenate.” It will take a sequence of comma- 
separated things inside the parentheses and join them together 
into a vector where each element is still individually accessible. 


c(l, 2, 3, 1, 3, 5, 25) 


## [1] 1 2 3 1 3 5 25 
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Instead of sending the result to the console, we can instead 

You can type the arrow using < and then assign it to an object we create: 

my.numbers •<- c(1, 2, 3, 1, 3, 5, 25) 
your.numbers c(5, 31, 71, 1, 3, 21, 6) 

To see what you made, type the name of the object and hit 
return: 


my_numbers 

## [1] 1 2 3 1 3 5 25 


If you learn only one keyboard shortcut in RStudio, 
make it this one! Always use Option+minus on 
MacOS or Alt+minus on Windows to type the 
assignment operator. 


Each of our numbers is still there and can be accessed directly if 
we want. They are now just part of a new object, a vector, called 
my.numbers. 

You create objects by assigning them to names. The assign¬ 
ment operator is Think of assignment as the verb “gets,” reading 
left to right. So the bit of code above can be read as “The object 
my .numbers gets the result of concatenating the following num¬ 
bers: 1,2,...” The operator is two separate keys on your keyboard: 
the < key and the - (minus) key. Because you type this so often in R, 
there is a shortcut for it in R Studio. To write the assignment oper¬ 
ator in one step, hold down the option key and hit On Windows 
hold down the alt key and hit You will be constantly creating 
objects in this way, and trying to type the two characters separately 
is both tedious and prone to error. You will make hard-to-notice 
mistakes like typing < - (with a space in between the characters) 
instead of 

When you create objects by assigning things to names, they 
come into existence in R’s workspace or environment. You can 
think of this most straightforwardly as your project directory. Your 
workspace is specific to your current project. It is the folder from 
which you launched R. Unless you have particular needs (such 
as extremely large datasets or analytical tastes that take a very 
long time) you will not need to give any thought to where objects 
“really” live. Just think of your code and data files as the permanent 
features of your project. When you start up an R project, you will 
generally begin by loading your data. That is, you will read it in 
from disk and assign it to a named object like my.data. The rest of 
your code will be a series of instructions to act on and create more 
named objects. 
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You do things using functions 


You do almost everything in R using functions. Think of a function 
as a special kind of object that can perform actions for you. It pro¬ 
duces output based on the input it receives. When we want a func¬ 
tion to do something for us, we call it. It will reliably do what we 
tell it. We give the function some information, it acts on that infor¬ 
mation, and some results come out the other side. A schematic 
example is shown in fig. 2.5. Functions can be recognized by the 
parentheses at the end of their names. This distinguishes them 
from other objects, such as single numbers, named vectors, and 
tables of data. 

The parentheses are what allow you to send information to 
the function. Most functions accept one or more named argu¬ 
ments. A function’s arguments are the things it needs to know 
in order to do something. They can be some bit of your data 
(data = my_numbers), or specific instructions (title = "GDP 
per Capita"), or an option you want to choose (smoothing = 
"splines", show = FALSE). For example, the object my_numbers 
is a numeric vector: 


fn_name( argumentl =<value1>, 
argument2 = <value2>, 
arguments = <value3>) 

plot_it(xvals = my_numbers, 
yvals = your_numbers, 
titles “Our Number Plot”) 

Figure 2.5: Upper: What functions look like, 
schematically. Lower, an imaginary function that 
takes two vectors and plots them with a title. We 
supply the function with the particular vectors we 
want it to use, and the title. The vectors are object 
so are given as is. The title is not an object so we 
enclose it in quotes. 


my_numbers 

## [1] 1 2 3 1 3 5 25 

But the thing we used to create it, c(), is a function. It 
concatenates items into a vector composed of the series of comma- 
separated elements you give it. Similarly, meanQ is a function 
that calculates a simple average for a vector of numbers. What 
happens if we just type meanQ without any arguments inside the 
parentheses? 


mean() 

# Error in mean.default() : argument 'x' is 

# missing, with no default 


The error message is terse but informative. The function 
needs an argument to work, and we haven’t given it one. In this 
case, ‘x’, the name of another object that meanQ can perform its 
calculation on: 
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mean(x = my_numbers) 


## [1] 5.714286 


mean(x = your_numbers) 

## [1] 19.71429 

While the function arguments have names that are used inter¬ 
nally, (here, x), you don’t strictly need to specify the name for the 
function to work: 


mean(my_numbers) 


## [1] 5.714286 


See the appendix for a guide to how to read the 
help page for a function. 


If you omit the name of the argument, R will just assume 
you are giving the function what it needs, and in a default order. 
The documentation for a function will tell you what the order of 
required arguments is for any particular function. For simple func¬ 
tions that require only one or two arguments, omitting their names 
is usually not confusing. For more complex functions, you will typ¬ 
ically want to use the names of the arguments rather than try to 
remember what the ordering is. 

In general, when providing arguments to a function the syn¬ 
tax is <argument> = <value>. If <value> is a named object that 
already exists in your workspace, like a vector of numbers or a table 
of data, then you provide it unquoted, as in mean(iiiy_nuinbers). 
If <value> is not an object, a number, or a logical value like 
TRUE, then you usually put it in quotes, e.g., labels (x = "X Axis 

Label"). 

Functions take inputs via their arguments, do something, and 
return outputs. What the output is depends on what the func¬ 
tion does. The c() function takes a sequence of comma-separated 
elements and returns a vector consisting of those same elements. 
The mean () function takes a vector of numbers and returns a single 
number, their average. Functions can return far more than single 
numbers. The output returned by functions can be a table of data, 
or a complex object such as the results of a linear model, or the 
instructions needed to draw a plot on the screen (as we shall see). 





For example, the summary () function performs a series of calcula¬ 
tions on a vector and produces what is in effect a little table with 
named elements. 

A functions argument names are internal to that function. 
Say you have created an object in your environment named x, for 
example. A function like mean() also has a named argument, x, 
but R will not get confused by this. It will not use your x object by 
mistake. 

As we have already seen with c() and mean(), you can assign 
the result of a function to an object: 


my_summary •<- summary(my_numbers) 


When you do this, there’s no output to the console. R just puts 
the results into the new object, as you instructed. To look inside 
the object you can type its name and hit return: 

my_summary 

## Min. 1st Qu. Median Mean 3rd Qu. Max. 

## 1.00 1.50 3.00 5.71 4.00 25.00 

Functions come in packages 

The code you write will be more or less complex depending on the 
task you want to accomplish. Once you have gotten used to work¬ 
ing in R, you will probably end up writing your own functions to 
produce the results that you need. But as with other programming 
languages, you will not have to do everything yourself. Families 
of useful functions are bundled into packages that you can install, 
load into your R session, and make use of as you work. Packages 
save you from reinventing the wheel. They make it so that you do 
not, for example, have to figure out how to write code from scratch 
to draw a shape on screen, or load a data file into memory. Pack¬ 
ages are also what allow you to build on the efforts of others in 
order to do your own work. Ggplot is a library of functions. There 
are many other such packages, and we will make use of several 
throughout this book, by either loading them with the library () 
function or “reaching in” to them and pulling a useful function 
from them directly. Writing code and functions of your own is a 
good way to get a sense of the amazing volume of effort put into R 
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and its associated toolkits, work freely contributed by many hands 
over the years and available for anyone to use. 

All the visualization we will do involves choosing the right 
function or functions and then giving those functions the right 
instructions through a series of named arguments. Most of the 
mistakes we will make, and the errors we will fix, will involve us 
having not picked the right function, or having not fed the func¬ 
tion the right arguments, or having failed to provide information 
in a form the function can understand. 

For now, just remember that you do things in R by creating and 
manipulating named objects. You manipulate objects by feeding 
information about them to functions. The functions do something 
useful with that information (calculate a mean, recode a variable, 
fit a model) and give you the results back. 


table(my.nuinbers) 


## my.numbers 
## 1 2 3 5 25 
##21211 


sd(my.nuinbers) 

## [ 1 ] 8.6 

my.numbers * 5 

## [1] 5 10 15 5 15 25 125 

my.numbers + 1 

## [1] 2 3 4 2 4 6 26 


my.numbers + my.numbers 
## [1] 2 4 6 2 6 10 50 

The first two functions here gave us a simple table of counts 
and calculated the standard deviation of my .numbers. It’s worth 
noticing what R did in the last three cases. First we multiplied 
my .numbers by five. R interprets that as you asking it to take each 
element of my .numbers one at a time and multiply it by five. It does 
the same with the instruction my.numbers + 1. The single value 







is “recycled” down the length of the vector. By contrast, in the last 
case we add my .numbers to itself. Because the two objects being 
added are the same length, R adds each element in the first vec¬ 
tor to the corresponding element in the second vector. This is an 
example of a vectorized operation. 


If you're not sure what an object is, ask for its class 

Every object has a class. This is the sort of object it is, whether a 
vector, a character string, a function, a list, and so on. Knowing an 
object’s class tells you a lot about what you can and can’t do with it. 


class(my.numbers) 


## [1] "numeric" 


class(my_summary) 

## [1] "summaryDefault" "table" 


class(summary) 

## [1] "function" 

Certain actions you take may change an object’s class. For 
instance, consider my .numbers again: 


my.new.vector c(my_numbers, "Apple") 

my.new.vector 



## [1] "i" " 2 " " 3 " "i" "3" 

## [8] "Apple" 

"5" 

"25" 

class(my.new.vector) 




## [1] "character" 


The function added the word “Apple” to our vector of num¬ 
bers, as we asked. But in doing so, the result is that the new 
object also has a new class, switching from “numeric” to “charac¬ 
ter.” All the numbers are now enclosed in quotes. They have been 
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The appendix has a little more discussion of the 
basics of selecting the elements within objects. 


turned into character strings. In that form, they can’t be used in 
calculations. 

Most of the work we’ll be doing will not involve directly pick¬ 
ing out this or that value from vectors or other entities. Instead we 
will try to work at a slightly higher level that will be easier and safer. 
But it’s worth knowing the basics of how elements of vectors can 
be referred to because the c() function in particular is a useful tool. 

We will spend a lot of time in this book working with a series 
of datasets. These typically start life as files stored locally on your 
computer or somewhere remotely accessible to you. Once they are 
imported into R, then like everything else they exist as objects of 
some kind. R has several classes of objects used to store data. A 
basic one is a matrix, which consists of rows and columns of num¬ 
bers. But the most common kind of data object in R is a data frame, 
which you can think of as a rectangular table consisting of rows 
(of observations) and columns (of variables). In a data frame the 
columns can be of different classes. Some maybe character strings, 
some numeric, and so on. For instance, here is a very small dataset 
from the SOCviz library: 

titanic 


## 

fate 

sex 

n 

percent 

## 1 

perished 

male 

1364 

62.0 

## 2 

perished 

female 

126 

5.7 

## 3 

survived 

male 

367 

16.7 

## 4 

survived 

female 

344 

15.6 


class(titanic) 

## [1] 11 data.frame" 

In this titanic data, two of the columns are numeric and two 
are not. You can access the rows and columns in various ways. For 
example, the $ operator allows you to pick out a named column of 
a data frame: 


titanic$percent 

## [1] 62.0 5.7 16.7 15.6 

Appendix 1 contains more information about selecting particular 
elements from different kinds of objects. 




We will also regularly encounter a slightly augmented version 
of a data frame called a tibble. The tidyverse libraries make exten¬ 
sive use of tibbles. Like data frames, they are used to store variables 
of different classes all together in a single table of data. They also 
do a little more to let us know about what they contain and are a 
little more friendly when interacted with from the console. We can 
convert a data frame to a tibble if we want: 


titanic_tb as_tibble(titanic) 
titanic_tb 

## # A tibble: 4x4 

## fate sex n percent 

## <fct> <fct> <dbl> <dbl> 

## 1 perished male 1364. 62.0 

## 2 perished female 126. 5.70 

## 3 survived male 367. 16.7 

## 4 survived female 344. 15.6 

Look carefully at the top and bottom of the output to see what 
additional information the tibble class gives you over and above 
the data frame version. 

To see inside an object, ask for its structure 

The str() function is sometimes useful. It lets you see what is 
inside an object. 


str(my_numbers) 

## num [1:7] 1 231 3 5 25 


str(my_summary) 

## Classes 'summaryDefault', 'table' 

Named num [1:6] 1 1.5 3 5.71 4 ... 

## ..- attr(*, "names")= chr [1:6] "Min." 

"1st Qu." "Median" "Mean" ... 

Fair warning: while some objects are relatively simple (a vec¬ 
tor is just a sequence of numbers), others are more complicated, 
so asking about their str() might output a forbidding amount 
of information to the console. In general, complex objects are 
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RStudio also has the ability to summarize and 
look inside objects, via its Environment tab. 


organized collections of simpler objects, often assembled as a big 
list, sometimes with a nested structure. Think, for example, of a 
master to-do list for a complex activity like moving house. It might 
be organized into subtasks of different kinds, several of which 
would themselves have lists of individual items. One list of tasks 
might be related to scheduling the moving truck, another might 
cover things to be donated, and a third might be related to setting 
up utilities at the new house. In a similar way, the objects we create 
to make plots will have many parts and subparts, as the overall task 
of drawing a plot has many individual to-do items. But we will be 
able to build these objects up from simple forms through a series 
of well-defined steps. And unlike moving house, the computer will 
take care of actually carrying out the task for us. We just need to 
get the to-do list right. 


2.4 Be Patient with R, and with Yourself 

Like all programming languages, R does exactly what you tell it to, 
rather than exactly what you want it to. This can make it frustrating 
to work with. It is as if one had an endlessly energetic, powerful, 
but also extremely literal-minded robot to order around. Remem¬ 
ber that no one writes fluent, error-free code on the first go all the 
time. From simple typos to big misunderstandings, mistakes are a 
standard part of the activity of programming. This is why error 
checking, debugging, and testing are also a central part of pro¬ 
gramming. So just try to be patient with yourself and with R while 
you use it. Expect to make errors, and don’t worry when that hap¬ 
pens. You won’t break anything. Each time you figure out why a bit 
of code has gone wrong, you will have learned a new thing about 
how the language works. 

Here are three specific things to watch out for: 

. Make sure parentheses are balanced and that every opening 
“(” has a corresponding closing “)”. 

. Make sure you complete your expressions. If you think you 
have completed typing your code but instead of seeing the > 
command prompt at the console you see the + character, that 
may mean R thinks you haven’t written a complete expression 
yet. You can hit Esc or Ctrl-C to force your way back to the 
console and try typing your code again. 
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. In ggplot specifically, as you will see, we will build up plots a 
piece at a time by adding expressions to one another. When 
doing this, make sure your + character goes at the end of the 
line and not the beginning. That is, write this: 


ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point() 


and not this: 


ggplot(data = mpg, aes(x = displ, y = hwy)) 
+geom_point() 


R Studio will do its best to help you with the task of writing 
your code. It will highlight your code by syntax; it will try to match 
characters (like parentheses) that need to be balanced; it will try to 
narrow down the source of errors in code that fails to run; it will try 
to auto-complete the names of objects you type so that you make 
fewer typos; it will make help files more easily accessible and the 
arguments of functions directly available. Go slowly and see how 
the software is trying to help you out. 


2.5 Get Data into R 

Before we can plot anything at all, we have to get our data into R 
in a format it can use. Cleaning and reading in your data is one 
of the least immediately satisfying pieces of an analysis, whether 

you use R, Stata, SAS, SPSS, or any other statistical software. This These are all commercial software applications for 

is the reason that many of the datasets for this book are provided statistical analysis, stata, in particular, is in wide use 

across the social sciences. 

in a preprepared form via the socviz library rather than as data 
files you must manually read in. However, it is something you will 
have to face sooner rather than later if you want to use the skills 
you learn in this book. We might as well see how to do it now. Even 
when learning R, it can be useful and motivating to try out the code 
on your own data rather than working with the sample datasets. 

Use the read_csv() function to read in comma-separated 
data. This function is in the readr package, one of the pieces of 
the tidyverse. R and the tidyverse also have functions to import 
various Stata, SAS, and SPSS formats directly. These can be found 
in the haven package. All we need to do is point read_csv() at a 
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file. This can be a local file, e.g., in a subdirectory called data/, 
or it can be a remote file. If read_csv() is given a URL or ftp 
address, it will follow it automatically. In this example, we have 
a CSV file called organdonation .csv stored at a trusted remote 
location. While online, we assign the URL for the file to an object, 
for convenience, and then tell read_csv () to fetch it for us and put 
it in an object named organs. 

url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv" 
organs <- read_csv(file = url) 

## Parsed with column specification: 

## cols( 

## .default = col_character(), 

## year = col_integer(), 

## donors = col_double(), 

## pop = col_integer(), 

## pop.dens = col_double(), 

## gdp = col_integer(), 

## gdp.lag = col_integer(), 

## health = col_double(), 

## health.lag = col_double(), 

## pubhealth = col_double(), 

## roads = col_double(), 

## cerebvas = col_integer(), 

## assault = col_integer(), 

## external = col_integer(), 

## txp.pop = col_double() 

## ) 


## See spec(...) for full column specifications. 

The resulting message at the console tells us the read_csv() 
function has assigned a class to each column of the object it created 
from the CSV file. There are columns with integer values, some 
are character strings, and so on. (The double class is for num¬ 
bers other than integers.) Part of the reason read_csv() is telling 
you this information is that it is helpful to know what class each 
column, or variable, is. A variables class determines what sort of 
operations can be performed on it. You also see this information 
because the tidyverse’s read_csv() (with an underscore character 
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in the middle of its name) is more opinionated than an older, and 
also still widely used, function, read.csv() (with a period in the 
middle of its name). The newer reaches v () will not classify vari¬ 
ables as factors unless you tell it to. This is in contrast to the older 
function, which treats any vector of characters as a factor unless 
told otherwise. Factors have some very useful features in R (espe¬ 
cially when it comes to representing various kinds of treatment and 
control groups in experiments), but they often trip up users who 
are not fully aware of them. Thus read_csv() avoids them unless 
you explicitly say otherwise. 

R can read in data files in many different formats. The haven 
package provides functions to read files created in a variety of com¬ 
mercial software packages. If your dataset is a Stata . dta file, for 
instance, you can use the read_dta() function in much the same 
way as we used read_csv() above. This function can read and 
write variables stored as logical values, integers, numbers, char¬ 
acters, and factors. Stata also has a labeled data class that the haven 
library partially supports. In general you will end up converting 
labeled variables to one of R’s basic classes. Stata also supports an 
extensive coding scheme for missing data. This is generally not 
used directly in R, where missing data is coded simply as N A. Again, 
you will need to take care that any labeled variables imported into 
R are coded properly, so that you do not end up mistakenly using 
missing data in your analysis. 

When preparing your data for use in R, and in particular for 
graphing with ggplot, bear in mind that it is best if it is represented 
in a “tidy” format. Essentially this means that your data should 
be in long rather than wide format, with every observation a row 
and every variable a column. We will discuss this in more detail in 
chapter 3, and you can also consult the discussion of tidy data in 
the appendix. 


2.6 Make Your First Figure 

That’s enough ground clearing for now. Writing code can be frus¬ 
trating, but it also allows you to do interesting things quickly. Since 
the goal of this book is not to teach you all about R but just how to 
produce good graphics, we can postpone a lot of details until later 
(or indeed ignore them indefinitely). We will start as we mean to go 
on, by using a function to make a named object, and plot the result. 


R can also talk directly to databases, a topic not 
covered here. 


See haven's documentation for more details. 
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We will use the Gapminder dataset, which you should already have 
available on your computer. We load the data with libraryQ and 
take a look. 

library(gapminder) 

gapminder 


## 

# A tibble: 1,704 x 6 





## 

country continent 

year lifeExp 

pop gdpPercap 

## 

<fct> <fct> 

<int> 

<dbl> 

<int> 

<dbl> 

## 

1 Afghanistan Asia 

1952 

28.8 

8425333 

779. 

## 

2 Afghanistan Asia 

1957 

30.3 

9240934 

821. 

## 

3 Afghanistan Asia 

1962 

32.0 

10267083 

853. 

## 

4 Afghanistan Asia 

1967 

34.0 

11537966 

836. 

## 

5 Afghanistan Asia 

1972 

36.1 

13079460 

740. 

## 

6 Afghanistan Asia 

1977 

38.4 

14880372 

786. 

## 

7 Afghanistan Asia 

1982 

39.9 

12881816 

978. 

## 

8 Afghanistan Asia 

1987 

40.8 

13867957 

852. 

## 

9 Afghanistan Asia 

1992 

41.7 

16317921 

649. 

## 

10 Afghanistan Asia 

1997 

41.8 

22227415 

635. 


## # ... with 1,694 more rows 

This is a table of data about a large number of countries, each 
observed over several years. Let’s make a scatterplot with it. Type 
the code below and try to get a sense of what’s happening. Don’t 
worry too much yet about the details. 


p ggplot(data = gapminder, 

mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point() 


Not a bad start. Our graph (fig. 2.6) is fairly legible, it has its 
axes informatively labeled, and it shows some sort of relationship 
between the two variables we have chosen. It could also be made 
better. Let’s learn more about how to improve it. 


2.7 Where to Go Next 

You should go straight to the next chapter. However, you could 
also spend a little more time getting familiar with R and RStudio. 
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Figure 2.6: Life expectancy plotted against GDP per 
capita for a large number of country-years. 


Some information in the appendix to this book might already be 
worth glancing at, especially the additional introductory material 
on R, and the discussion there about some common problems that 
tend to happen when reading in your own data. There are sev¬ 
eral free or initially free online introductions to the R language 
that are worth trying. You do not need to know the material they 
cover in order to keep going with this book, but you might find 
one or more of them useful. If you get a little bogged down in any 
of them or find the examples they choose are not that relevant to 
you, don’t worry. These introductions tend to want to introduce 
you to a range of programming concepts and tools that we will not 
need right away 

It is also worth familiarizing yourself a little with how RStu- 
dio works and with what it can do for you. The RStudio website 
has a great deal of introductory material to help you along. You 
can also find a number of handy cheat sheets there that summa¬ 
rize different pieces of RStudio, RMarkdown, and various tidyverse 
packages that we will use throughout the book. These cheat sheets 
are not meant to teach you the material, but they are helpful points 
of reference once you are up and running. 


swirlstats.com 

tryr.codeschool.com 

datacamp.com 


rstudio.com 


rstudio.com/resources/cheatsheets 



3 Make a Plot 


This chapter will teach you how to use ggplot’s core functions to 
produce a series of scatterplots. From one point of view, we will 
proceed slowly and carefully, taking our time to understand the 
logic behind the commands that you type. The reason for this is 
that the central activity of visualizing data with ggplot more or less 
always involves the same sequence of steps. So it is worth learning 
what they are. 

From another point of view, though, we will move fast. Once 
you have the basic sequence down and understand how it is that 
ggplot assembles the pieces of a plot into a final image, then 
you will find that analytically and aesthetically sophisticated plots 
come within your reach very quickly. By the end of this chapter, 
for example, we will have learned how to produce a small-multiple 
plot of time-series data for a large number of countries, with a 
smoothed regression line in each panel. 


3.1 How Ggplot Works 

As we saw in chapter 1, visualization involves representing your 
data using lines or shapes or colors and so on. There is some struc¬ 
tured relationship, some mapping, between the variables in your 
data and their representation in the plot displayed on your screen 
or on the page. We also saw that not all mappings make sense for 
all types of variables, and (independently of this) some representa¬ 
tions are harder to interpret than others. Ggplot provides you with 
a set of tools to map data to visual elements on your plot, to spec¬ 
ify the kind of plot you want, and then to control the fine details 
of how it will be displayed. Figure 3.1 shows a schematic outline of 
the process starting from data, at the top, down to a finished plot 
at the bottom. Don’t worry about the details for now. We will be 
going into them one piece at a time over the next few chapters. 
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The most important thing to get used to with ggplot is the way 
you use it to think about the logical structure of your plot. The 
code you write specifies the connections between the variables in 
your data, and the colors, points, and shapes you see on the screen. 
In ggplot, these logical connections between your data and the plot 
elements are called aesthetic mappings or just aesthetics. You begin 
every plot by telling the ggplot() function what your data is and 
how the variables in this data logically map onto the plot’s aesthet¬ 
ics. Then you take the result and say what general sort of plot you 
want, such as a scatterplot, a boxplot, or a bar chart. In ggplot, 
the overall type of plot is called a geom. Each geom has a func¬ 
tion that creates it. For example, geom_point() makes scatterplots, 
geom_bar() makes bar plots, geom_boxplot() makes boxplots, 
and so on. You combine these two pieces, the ggplot() object and 
the geom, by literally adding them together in an expression, using 
the “+” symbol. 

At this point, ggplot will have enough information to be able to 
draw a plot for you. The rest is just details about exactly what you 
want to see. If you don’t specify anything further, ggplot will use 
a set of defaults that try to be sensible about what gets drawn. But 
more often you will want to specify exactly what you want, includ¬ 
ing information about the scales, the labels of legends and axes, 
and other guides that help people to read the plot. These pieces 
are added to the plot in the same way as the geom_ function was. 
Each component has its own function, you provide arguments to 
it specifying what to do, and you literally add it to the sequence of 
instructions. In this way you systematically build your plot piece 
by piece. 

In this chapter we will go through the main steps of this pro¬ 
cess. We will proceed by example, repeatedly building a series of 
plots. As noted earlier, I strongly encourage you go through this 
exercise manually, typing (rather than copying and pasting) the 
code yourself. This may seem a bit tedious, but it is by far the 
most effective way to get used to what is happening, and to get 
a feel for R’s syntax. While you’ll inevitably make some errors, you 
will also quickly find yourself becoming able to diagnose your own 
mistakes, as well as having a better grasp of the higher-level struc¬ 
ture of plots. You should open the RMarkdown file for your notes, 
remember to load the tidyverse package, and write the code out in 
chunks, interspersing your own notes and comments as you go. 


library(tidyverse) 
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I.Tidy data 


P ggplot(data = gapminder, ... 


gdp 

lifexp 

pop 

continent 

340 

65 

31 

Euro 

227 

51 

200 

Amer 

909 

81 

80 

Euro 

126 

40 

20 

Asia 



P ggplot(data = gapminder, 

mapping = aes(x = gdp, 
y = lifexp, size = pop, 
color = continent)) 



p + geom_point() 



p + coord_cartesian() + 
scale_x_log10() 



x 



p + labs(x = “log GDP”, 

y = “Life Expectancy”, 
title = “A Gapminder Plot”) 



'• Asia ' 

o Euro 
O Amer 

• 0—35 ' 

0 36-100 



log GDP 


Figure 3.1: The main elements of ggplofs grammar 
of graphics. This chapter goes through these steps 
in detail. 


3.2 Tidy Data 

The tidyverse tools we will be using want to see your data in a par¬ 
ticular sort of shape, generally referred to as “tidy data” (Wickham 
2014). Social scientists will likely be familiar with the distinction 
between wide-format and long-format data. In a long-format table, 
every variable is a column, and every observation is a row. In a 
wide-format table, some variables are spread out across columns. 
For example, table 3.1 shows part of a table of life expectancy over 
time for a series of countries. This is in wide format because one 
of the variables, year, is spread across the columns of the table. 

By contrast, table 3.2 shows the beginning of the same data in 
long format. The tidy data that ggplot wants is in this long form. 
In a related bit of terminology, in this table the year variable is 
sometimes called a key and the life Exp variable is the value taken 
by that key for any particular row. These terms are useful when 
converting tables from wide to long format. I am speaking fairly 
loosely here. Underneath these terms there is a worked-out theory 
of the forms that tabular data can be stored in, but right now we 
don’t need to know those additional details. For more on the ideas 
behind tidy data, see the discussion in the appendix. You will also 
find an example showing the R code you need to get from an untidy 
to a tidy shape for the common “wide” case where some variables 
are spread out across the columns of your table. 

If you compare tables 3.1 and 3.2, it is clear that a tidy table 
does not present data in its most compact form. In fact, it is 
usually not how you would choose to present your data if you 
wanted just to show people the numbers. Neither is untidy data 
“messy” or the “wrong” way to lay out data in some generic sense. 
It’s just that, even if its long-form shape makes tables larger, tidy 
data is much more straightforward to work with when it comes 
to specifying the mappings that you need to coherently describe 
plots. 


3.3 Mappings Link Data to Things You See 

It’s useful to think of a recipe or template that we start from each 
time we want to make a plot. This is shown in figure 3.2. We start 
with just one object of our own, our data, which should be in a 
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Table 3.1 

Life Expectancy data in wide format. 


country 

1952 

1957 

1962 

1967 

1972 

1977 

1982 

1987 

1992 

1997 

2002 

2007 

Afghanistan 

29 

30 

32 

34 

36 

38 

40 

41 

42 

42 

42 

44 

Albania 

55 

59 

65 

66 

68 

69 

70 

72 

72 

73 

76 

76 

Algeria 

43 

46 

48 

51 

55 

58 

61 

66 

68 

69 

71 

72 

Angola 

30 

32 

34 

36 

38 

39 

40 

40 

41 

41 

41 

43 

Argentina 

62 

64 

65 

66 

67 

68 

70 

71 

72 

73 

74 

75 

Australia 

69 

70 

71 

71 

72 

73 

75 

76 

78 

79 

80 

81 


shape that ggplot understands. Usually this will be a data frame or 
some augmented version of it, like a tibble. We tell the core ggplot 
function what our data is. In this book, we will do this by creating 
an object named p, which will contain the core information for our 
plot. (The name p is just a convenience.) Then we choose a plot 
type, or geom, and add it to p. From there we add more features to 
the plot as needed, such as additional elements, adjusted scales, a 
title, or other labels. 

We’ll use the gapminder data to make our first plots. Make 
sure the library containing the data is loaded. If you are following 
through from the previous chapter in the same RStudio session or 
RMarkdown document, you won’t have to load it again. Otherwise, 
use library () to make it available. 

library(gapminder) 


Table 3.2 

Life Expectancy data in long format. 


country 

year 

lifeExp 

Afghanistan 

1952 

29 

Afghanistan 

1957 

30 

Afghanistan 

1962 

32 

Afghanistan 

1967 

34 

Afghanistan 

1972 

36 

Afghanistan 

1977 

38 


We can remind ourselves again what it looks like by typing the 
name of the object at the console: 


gapminder 


## # A tibble: 1,704 x 6 


## 

country continent 

year lifeExp 

pop gdpPercap 

## 

<fct> <fct> 

<int> 

<dbl> 

<int> 

<dbl> 

## 

1 Afghanistan Asia 

1952 

28.8 

8425333 

779. 

## 

2 Afghanistan Asia 

1957 

30.3 

9240934 

821. 

## 

3 Afghanistan Asia 

1962 

32.0 

10267083 

853. 

## 

4 Afghanistan Asia 

1967 

34.0 

11537966 

836. 

## 

5 Afghanistan Asia 

1972 

36.1 

13079460 

740. 


P ggplot(data= <data>, 

mapping= aes(<aesthetic> = <variable>, 
<aesthetic>= <variable>, 
<...> = <...>) 

p + geom_<type>(<...>) + 

scale_<mapping>_<type>(<...> ) + 
coord_<type>(<...> ) + 
labs(<...>) 

Figure 3.2: A schematic for making a plot. 
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Remember, use Option+minus on MacOS or 
Alt+minus on Windows to type the assignment 
operator. 


## 

6 Afghanistan Asia 

1977 

38.4 14880372 

786 

## 

7 Afghanistan Asia 

1982 

39.9 12881816 

978 

## 

8 Afghanistan Asia 

1987 

40.8 13867957 

852 

## 

9 Afghanistan Asia 

1992 

41.7 16317921 

649 

## 10 Afghanistan Asia 

1997 

41.8 22227415 

635 


## # ... with 1,694 more rows 

Let’s say we want to plot life expectancy against per capita GDP 
for all country-years in the data. We’ll do this by creating an object 
that has some of the necessary information in it and build it up 
from there. First, we must tell the ggplotQ function what data we 
are using. 


p ggplot(data = gapminder) 


You do not need to explicitly name the arguments 
you pass to functions, as long as you provide them 
in the expected order, viz, the order listed on the 
help page for the function. This code would still 
work if we omitted data = and mapping = .In this 
book, I name all the arguments for clarity. 


At this point ggplot knows our data but not the mapping. That 
is, we need to tell it which variables in the data should be repre¬ 
sented by which visual elements in the plot. It also doesn’t know 
what sort of plot we want. In ggplot, mappings are specified using 
the aes() function, like this: 

p ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 


Here we’ve given the ggplot() function two arguments 
instead of one: data and mapping. The data argument tells ggplot 
where to find the variables it is about to use. This saves us from hav¬ 
ing to tediously dig out the name of each variable in full. Instead, 
any mentions of variables will be looked for here first. 

Next, the mapping. The mapping argument is not a data object, 
nor is it a character string. Instead, it’s a function. (Remember, 
functions can be nested inside other functions.) The arguments we 
give to the aes function are a sequence of definitions that ggplot 
will use later. Here they say, “The variable on the x-axis is going 
to be gdpPercap, and the variable on the y-axis is going to be 
lifeExp.” The aes() function does not say where variables with 
those names are to be found. That’s because ggplotQ is going to 
assume that things with that name are in the object given to the 
data argument. 

The mapping = aes(. ..) argument links variables to things 
you will see on the plot. The x and y values are the most obvious 
ones. Other aesthetic mappings can include, for example, color, 
shape, size, and line type (whether a line is solid, dashed, or some 
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other pattern). We’ll see examples in a minute. A mapping does 
not directly say what particular colors or shapes, for example, will 
be on the plot. Rather it says which variables in the data will be 
represented by visual elements like a color, a shape, or a point on 
the plot area. 

What happens if we just type p at the console at this point and 
hit return? The result is shown in figure 3.3. 

P 

The p object has been created by the ggplot() function and 
already has information in it about the mappings we want, together 
with a lot of other information added by default. (If you want to 
see just how much information is in the p object already, try ask¬ 
ing for str(p).) However, we haven’t given it any instructions yet 
about what sort of plot to draw. We need to add a layer to the plot. 
This means picking a geom_ function. We will use geom_point(). 
It knows how to take x and y values and plot them in a scatterplot. 


































30000 60000 90000 


gdpPercap 

Figure 3.3: This empty plot has no geoms. 


p + geom_point() 


3.4 Build Your Plots Layer by Layer 

Although we got a brief taste of ggplot at the end of chapter 2, we 
spent more time in that chaper preparing the ground to make this 
first proper graph. We set up our software IDE and made sure we 



Figure 3.4: A scatterplot of life expectancy vs GDP. 
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could reproduce our work. We then learned the basics of how R 
works, and the sort of tidy data that ggplot expects. Just now we 
went through the logic of ggplot’s main idea, of building up plots a 
piece at a time in a systematic and predictable fashion, beginning 
with a mapping between a variable and an aesthetic element. We 
have done a lot of work and produced one plot. 

The good news is that, from now on, not much will change 
conceptually about what we are doing. It will be more a ques¬ 
tion of learning in greater detail about how to tell ggplot what 
to do. We will learn more about the different geoms (or types 
of plot) available and find out about the functions that control 
the coordinate system, scales, guiding elements (like labels and 
tick-marks), and thematic features of plots. This will allow us to 
make much more sophisticated plots surprisingly fast. Conceptu¬ 
ally, however, we will always be doing the same thing. We will start 
with a table of data that has been tidied, and then we will do the 
following: 


Tell the ggplot() function what our data is. 

Tell ggplot() what relationships we want to see. For conve¬ 
nience we will put the results of the first two steps in an object 
called p. 

Tell ggplot how we want to see the relationships in our data. 
Layer on geoms as needed, by adding them to the p object one 
at a time. 

Use some additional functions to adjust scales, labels, tick- 
marks, titles. We’ll learn more about some of these functions 
shortly. 

To begin with we will let ggplot use its defaults for many of 
these elements. The coordinate system used in plots is most often 
cartesian, for example. It is a plane defined by an x-axis and a 
y-axis. This is what ggplot assumes, unless you tell it otherwise. 
But we will quickly start making some adjustments. Bear in mind 
once again that the process of adding layers to the plot really is 
In effect we create one big object that is a nested additive. Usually in R, functions cannot simply be added to objects. 

the < plot StrUCtl0nS f ° r h ° W t0 draW SaCh P ' eCe ° f Rather, they take objects as inputs and produce objects as outputs. 

But the objects created by ggplot() are special. This makes it easier 
to assemble plots one piece at a time, and to inspect how they look 
at every step. For example, let’s try a different geom_ function with 
our plot. 


The data = ... step. 1. 

The mapping = aes(. ..) step. 2. 

Choose a geom. 3. 

4. 

The scale., family, labsQ and guidesQ functions. 5. 
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P •<- ggplot(data = gapmindar, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_smooth() 


You can see right away in figure 3.5 that some of these 
geoms do a lot more than simply put points on a grid. Here 
geom_smooth() has calculated a smoothed line for us and shaded 
in a ribbon showing the standard error for the line. If we want to 
see the data points and the line together (fig. 3.6), we simply add 
geom_point() back in: 


P «- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point() + geom_smooth() 



Figure 3.5: Life expectancy vs GDP, using a 
smoother. 


## 'geom_smooth()' using method = 'gam 1 and formula ■y ~ s(x, bs = "cs") 


The console message R tells you the geom_smooth() function 
is using a method called gam, which in this case means it has fit 
a generalized additive model. This suggests that maybe there are 
other methods that geom_smooth() understands, and which we 
might tell it to use instead. Instructions are given to functions via 
their arguments, so we can try adding method = "lm" (for “linear 
model”) as an argument to geom_smooth(): 


P •*- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point() + geom_smooth(method = "lm") 



Figure 3.6: Life expectancy vs GDP, showing both 
points and a GAM smoother. 


For figures 3.5 to 3.7 we did not have to tell geom_point() 
or geom_smooth() where their data was coming from, or what 
mappings they should use. They inherit this information from the 
original p object. As we’ll see later, it’s possible to give geoms sep¬ 
arate instructions that they will follow instead. But in the absence 
of any other information, the geoms will look for the instruc¬ 
tions they need in the ggplotQ function, or the object created 
by it. 

In our plot, the data is quite bunched up against the left side. 
Gross domestic product per capita is not normally distributed 
across our country years. The x-axis scale would probably look 
better if it were transformed from a linear scale to a log scale. For 
this we can use a function called scale_x_log10(). As you might 
expect, this function scales the x-axis of a plot to a log 10 basis. To 
use it we just add it to the plot: 



Figure 3.7: Life expectancy vs GDP, points and an 
ill-advised linear fit. 
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Figure 3.8: Life expectancy vs GDP scatterplot, with 
a GAM smoother and a log scale on the x-axis. 


P ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point() + geom_smooth(method = "gam") + scale_x_log10() 

The x-axis transformation repositions the points and also 
changes the shape the smoothed line (fig. 3.8). (We switched back 
to gam from lm.) While ggplotQ and its associated functions have 
not made any changes to our underlying data frame, the scale 
transformation is applied to the data before the smoother is lay¬ 
ered onto the plot. There are a variety of scale transformations that 
you can use in just this way. Each is named for the transformation 
you want to apply, and the axis you want to apply it to. In this case 
we use scale_x_log10(). 

At this point, if our goal was just to show a plot of life expec¬ 
tancy vs GDP using sensible scales and adding a smoother, we 
would b e thinking about p olishing up the plot with nicer axis lab els 
and a title. Perhaps we might also want to replace the scientific 
notation on the x-axis with the dollar value it actually represents. 
We can do both of these things quite easily. Let’s take care of the 
scale first. The labels on the tick-marks can be controlled through 
the scale_ functions. While it’s possible to roll your own function 
to label axes (or just supply your labels manually, as we will see 
later), there’s also a handy scales package that contains some use¬ 
ful premade formatting functions. We can either load the whole 
package with library (scales) or, more conveniently, just grab 
the specific formatter we want from that library. Here it’s the 
dollar () function. To grab a function directly from a package we 
have not loaded, we use the syntax thepackage:: thef unction. So 
we can do this: 



Figure 3.9: Life expectancy vs GDP scatterplot, with 
a GAM smoother and a log scale on the x-axis, with 
better labels on the tick-marks. 


P ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point() + 

geom_smooth(method = "gam") + 
scale_x_log10(labels = scales::dollar) 

We will learn more about scale transformations later. For now, 
just remember two things about them. First, you can directly trans¬ 
form yourx- or y-axis by adding something like scale_x_log10() 
or scale_y_log10() to your plot. When you do so, the x- or 
y-axis will be transformed, and, by default, the tick-marks on 
the axis will be labeled using scientific notation. Second, you can 
give these scale_ functions a labels argument that reformats 




the text printed underneath the tick-marks on the axes. Inside 
the scale_x_log10() function try labels=scales: :comma, for 
example. 
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3.5 Mapping Aesthetics vs Setting Them 


An aesthetic mapping specifies that a variable will be expressed by 
one of the available visual elements, such as size, or color, or shape. 
As we’ve seen, we map variables to aesthetics like this: 


P ggplotfdata = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, 
color = continent)) 


This code does not give a direct instruction like “color the 
points purple.” Instead it says, “the property ‘color’ will represent 
the variable continent,” or “color will map continent.” If we want 
to turn all the points in the figure purple, we do not do it through 
the mapping function. Look at what happens when we try: 

P <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, 
color = "purple")) 

p + geom_point() + geom_smooth(method = "loess") + scale_x_log10() 



Colour 
— Purple 


le+03 le+04 le+05 

gdpPercap 


Figure 3.10: What has gone wrong here? 


What has happened in figure 3.10? Why is there a legend 
saying “purple”? And why have the points all turned pinkish-red 
instead of purple? Remember, an aesthetic is a mapping of vari¬ 
ables in your data to properties you can see on the graph. The 
aes() function is where that mapping is specified, and the func¬ 
tion is trying to do its job. It wants to map a variable to the color 

aesthetic, so it assumes you are giving it a variable. We have only Just as in chapter i, when we were able to write 
given it one word, though—“purple.” Still, aes() will do its best to my.numbers + 1 to add one to each element of the 

treat that word as though it were a variable. A variable should have 
as many observations as there are rows in the data, so aes() falls 
back on R’s recycling rules for making vectors of different lengths 
match up. 

In effect, this creates a new categorical variable for your data. 

The string “purple” is recycled for every row of your data. Now you 
have a new column. Every element in it has the same value, “pur¬ 
ple.” Then ggplot plots the results on the graph as you’ve asked it 
to, by mapping it to the color aesthetic. It dutifully makes a legend 
for this new variable. By default, ggplot displays the points falling 
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into the category “purple” (which is all of them) using its default 
first-category hue, which is red. 

The aes() function is for mappings only. Do not use it to 
change properties to a particular value. If we want to set a prop¬ 
erty, we do it in the geom_ we are using, and outside the mapping = 
aes (...) step. Try this: 



1e+03 1e+04 1e+05 
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P ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point(color = "purple") + geom_smooth(method = "loess") + 
scale_x_log10() 

The geom_point() function can take a color argument 
directly, and R knows what color “purple” is (fig. 3.11). This is not 
part of the aesthetic mapping that defines the basic structure of 
the graphic. From the point of view of the grammar or logic of the 
graph, the fact that the points are colored purple has no signifi¬ 
cance. The color purple is not representing or mapping a variable 
or feature of the data in the relevant way. 


Figure 3.11: Setting the color attribute of the points 
directly. 



Figure 3.12: Setting some other arguments. 


It's also possible to map a continuous variable 
directly to the alpha property, much like one might 
map a continuous variable to a single-color gradient. 
However, this is generally not an effective way of 
precisely conveying variation in quantity. 


P ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point(alpha = 0.3) + geom_smooth(color = "orange", se = FALSE, 
size = 8, method = "lm") + scale_x_log10() 

The various geom_ functions can take many other arguments 
that will affect how the plot looks but do not involve mapping vari¬ 
ables to aesthetic elements. Thus those arguments will never go 
inside the aes() function. Some of the things we will want to set, 
like color or size, have the same name as mappable elements. Oth¬ 
ers, like the method or se arguments in geom_smooth() affect other 
aspects of the plot. In the code for figure 3.12, the geom_smooth( 
call sets the line color to orange and sets its size (i.e., thickness) to 
8, an unreasonably large value. We also turn off the se option by 
switching it from its default value of TRUE to FALSE. The result is 
that the standard error ribbon is not shown. 

Meanwhile in the geom_smooth() call we set the alpha argu¬ 
ment to 0.3. Like color, size, and shape, “alpha” is an aesthetic 
property that points (and some other plot elements) have, and to 
which variables can be mapped. It controls how transparent the 
object will appear when drawn. It’s measured on a scale of zero to 
one. An object with an alpha of zero will be completely transparent. 
Setting it to zero will make any other mappings the object might 
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have, such as color or size, invisible as well. An object with an 
alpha of one will be completely opaque. Choosing an intermedi¬ 
ate value can be useful when there is a lot of overlapping data to 
plot (as in fig. 3.13), as it makes it easier to see where the bulk of 
the observations are located. 


p ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) 
p + geom_point(alpha = 0.3) + 
geom_smooth(method = "gam") + 
scale_x_log10(labels = scales::dollar) + 
labs(x = "GDP Per Capita", y = "Life Expectancy in Years", 
title = "Economic Growth and Life Expectancy", 
subtitle = "Data points are country-years", 
caption = "Source: Gapminder.") 


We can now make a reasonably polished plot. We set the 
alpha of the points to a low value, make nicer x- and y-axis labels, 
and add a title, subtitle, and caption. As you can see in the code 
above, in addition to x, y, and any other aesthetic mappings in 
your plot (such as size, fill, or color), the labs() function can 
also set the text for title, subtitle, and caption. It controls 
the main labels of scales. The appearance of things like axis tick- 
marks is the responsibility of various scale_ functions, such as the 
scale_x_log10() function used here. We will learn more about 
what can be done with scale_ functions soon. 

Are there any variables in our data that can sensibly be mapped 
to the color aesthetic? Consider continent. In figure 3.14 the 
individual data points have been colored by continent, and a 
legend with a key to the colors has automatically been added to 
the plot. In addition, instead of one smoothing line we now have 
five. There is one for each unique value of the continent variable. 
This is a consequence of the way aesthetic mappings are inher¬ 
ited. Along with x and y, the color aesthetic mapping is set in 
the call to ggplot() that we used to create the p object. Unless told 
otherwise, all geoms layered on top of the original plot object will 
inherit that object’s mappings. In this case we get both our points 
and smoothers colored by continent. 


Economic growth and life expectancy 
Data points are country-years 



Figure 3.13: A more polished plot of Life Expectancy 
vs GDP. 
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Figure 3.14: Mapping the continent variable to the 
color aesthetic. 


P <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, 
color = continent)) 

p + geom_point() + geom_smooth(method = "loess") + scale_x_log10() 
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Figure 3.15: Mapping the continent variable to the 
color aesthetic, and correcting the error bars using 
the fill aesthetic. 
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Figure 3.16: Mapping aesthetics on a per geom 
basis. Here color is mapped to continent for the 
points but not the smoother. 


If it is what we want, then we might also consider shading the 
standard error ribbon of each line to match its dominant color, 
as in figure 3.15. The color of the standard error ribbon is con¬ 
trolled by the fill aesthetic. Whereas the color aesthetic affects 
the appearance of lines and points, fill is for the filled areas of bars, 
polygons and, in this case, the interior of the smoother’s standard 
error ribbon. 

P <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, 
color = continent, fill = continent)) 
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10() 

Making sure that color and fill aesthetics match up consistently 
in this way improves the overall look of the plot. To make it happen 
we just need to specify that the mappings are to the same variable 
in each case. 


3.6 Aesthetics Can Be Mapped per Geom 

Perhaps five separate smoothers is too many, and we just want 
one line. But we still would like to have the points color-coded 
by continent. By default, geoms inherit their mappings from the 
ggplotQ function. We can change this by specifying different 
aesthetics for each geom. We use the same mapping = aes(...) 
expression as in the initial call to ggplotQ but now use it in the 
geom_ functions as well, specifying the mappings we want to apply 
to each one (fig. 3.16). Mappings specified in the initial ggplot() 
function—here, x and y—will carry through to all subsequent 
geoms. 


P ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point(mapping = aes(color = continent)) + 
geom_smooth(method = "loess") + 
scale_x_log10() 


It’s possible to map continuous variables to the color aesthetic, 
too. For example, we can map the log of each country-year’s pop¬ 
ulation (pop) to color. (We can take the log of population right in 
theaesQ statement, using the logQ function. R will evaluate this 
for us.) When we do this, ggplot produces a gradient scale. It is 
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continuous but is marked at intervals in the legend. Depending on 
the circumstances, mapping quantities like population to a contin¬ 
uous color gradient (fig. 3.17) may be more or less effective than 
cutting the variable into categorical bins running, e.g., from low to 
high. In general it is always worth looking at the data in its contin¬ 
uous form first rather than cutting or binning it into categories. 


P *- ggplotfdata = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) 
p + geom_point(mapping = aes(color = log(pop))) + scale_x_log10() 


Finally, it is worth paying a little more attention to the way 
that ggplot draws its scales. Because every mapped variable has a 
scale, we can learn a lot about how a plot has been constructed, 
and what mappings it contains, by seeing what the legends look 
like. For example, take a closer look at the legends produced in 
figures 3.15 and 3.16. 

In the legend for the first figure, shown in figure 3.18 on the 
left, we see several visual elements. The key for each continent 
shows a dot, a line, and a shaded background. The key for the 
second figure, shown on the right, has only a dot for each con¬ 
tinent, with no shaded background or line. If you look again at the 
code for figures 3.15 and 3.16, you will see that in the first case we 
mapped the continent variable to both color and fill. We then 
drew the figure with geom_point() and fitted a line for each con¬ 
tinent with geom_smooth(). Points have color but the smoother 
understands both color (for the line itself) and fill (for the shaded 
standard error ribbon). Each of these elements is represented 
in the legend: the point color, the line color, and the ribbon fill. 
In the second figure, we decided to simplify things by having only 
the points be colored by continent. Then we drew just a single 
smoother for the whole graph. Thus, in the legend for that figure, 
the colored line and the shaded box are both absent. We see only 
a legend for the mapping of color to continent in geom_point(). 
Meanwhile on the graph itself the line drawn by geom_smooth() 
is set by default to a bright blue, different from anything on the 
scale, and its shaded error ribbon is set by default to gray. Small 
details like this are not accidents. They are a direct consequence 
of ggplot’s grammatical way of thinking about the relationship 
between the data behind the plot and the visual elements that 
represent it. 
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Figure 3.17: Mapping a continuous variable to color. 
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Figure 3.18: Guides and legends faithfully reflect the 
mappings they represent. 
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3.7 Save Your Work 

Now that you have started to make your own plots, you may be 
wondering how to save them, and perhaps also how to control their 
size and format. If you are working in an RMarkdown document, 
then the plots you make will be embedded in it, as we have already 
seen. You can set the default size of plots within your . Rmd docu¬ 
ment by setting an option in your first code chunk. This one tells 
R to make 8x5 figures: 


knitr: :opts_chunk$set(fig.width = 8, fig.height = 5) 


Because you will be making plots of different sizes and shapes, 
sometimes you will want to control the size of particular plots, 
without changing the default. To do this, you can add the same 
options to any particular chunk inside the curly braces at the 
beginning. Remember, each chunk opens with three backticks 
and then a pair of braces containing the language name (for us 
always r) and an optional label: 


{r example} 
p + geom_point() 


You can follow the label with a comma and provide a series 
of options if needed. They will apply only to that chunk. To 
make a figure twelve inches wide and nine inches high we 
say, e.g., {r example, fig. width = 12, fig. height = 9} in the 
braces section. 

You will often need to save your figures individually, as they 
will end up being dropped into slides or published in papers that 
are not produced using RMarkdown. Saving a figure to a file can 
be done in several different ways. When working with ggplot, 
the easiest way is to use the ggsaveQ function. To save the most 
recently displayed figure, we provide the name we want to save it 
under: 


ggsave(filename = "my_figure.png") 
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This will save the figure as a PNG file, a format suitable for Several other file formats are available as well. See 

displaying on web pages. If you want a PDF instead, change the the functions help page for details, 

extension of the file: 

ggsave (filename = "my_figure.pdf") 

Remember that, for convenience, you do not need to write 
filename = as long as the name of the file is the first argument you 
give ggsave(). You can also pass plot objects to ggsave(). For 
example, we can put our recent plot into an object called p_out 
and then tell ggsave () that we want to save that object. 

p_out -s- p + geom_point() + geom_smooth(method = "loess") + scale_x_log10() 
ggsave("my_figure.pdf", plot = p_out) 


When saving your work, it is useful to have one or more 
subfolders where you save only figures. You should also take 
care to name your saved figures in a sensible way. fig_1.pdf 
or my_figure.pdf are not good names. Figure names should be 
compact but descriptive, and consistent between figures within a 
project. In addition, although it really shouldn’t be the case in this 
day and age, it is also wise to play it safe and avoid file names con¬ 
taining characters likely to make your code choke in future. These 
include apostrophes, backticks, spaces, forward and back slashes, 
and quotes. 

The appendix contains a short discussion of how to organize 
your files within your project folder. Treat the project folder as the 
home base of your work for the paper or work you are doing, and 
put your data and figures in subfolders within the project folder. To 
begin with, using your file manager, create a folder named “figures” 
inside your project folder. When saving figures, you can use Kirill 
Mullers handy here library to make it easier to work with files 
and subfolders while not having to type in full file paths. Load the 
library in the setup chunk of your RMarkdown document. When 
you do, it tells you where “here” is for the current project. You will 
see a message saying something like this, with your file path and 
user name instead of mine: 

# here() starts at /Users/kjhealy/projects/socviz 
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You can then use the he re () function to make loading and sav¬ 
ing your work more straightforward and safer. Assuming a folder 
named “figures” exists in your project folder, you can do this: 

ggsave(here("flgures", "lifexp_vs_gdp_gradient.pdf"), plot = p_out) 

This saves p_out as a file called lifeexp_vs_gdp_gradient. pdf in 
the figures directory here, i.e., in your current project folder. 

You can save your figure in a variety of formats, depending on 
your needs (and also, to a lesser extent, on your particular com¬ 
puter system). The most important distinction to bear in mind 
is between vector formats and raster formats. A file with a vector 
format, like PDF or SVG, is stored as a set of instructions about 
lines, shapes, colors, and their relationships. The viewing software 
(such as Adobe Acrobat or Apple’s Preview application for PDFs) 
then interprets those instructions and displays the figure. Repre¬ 
senting the figure this way allows it to be easily resized without 
becoming distorted. The underlying language of the PDF format 
is Postscript, which is also the language of modern typesetting 
and printing. This makes a vector-based format like PDF the best 
choice for submission to journals. 

A raster -based format, on the other hand, stores images essen¬ 
tially as a grid of pixels of a predefined size with information about 
the location, color, brightness, and so on of each pixel in the 
grid. This makes for more efficient storage, especially when used 
in conjunction with compression methods that take advantage of 
redundancy in images in order to save space. Formats like JPG are 
compressed raster formats. A PNG file is a raster image format that 
supports lossless compression. For graphs containing an awful lot 
of data, PNG files will tend to be much smaller than the corre¬ 
sponding PDF. However, raster formats cannot be easily resized. 
In particular they cannot be expanded in size without becoming 
pixelated or grainy. Formats like JPG and PNG are the standard 
way that images are displayed on the web. The more recent SVG 
format is vector-based format but also nevertheless supported by 
many web browsers. 

In general you should save your work in several different for¬ 
mats. When you save in different formats and in different sizes you 
may need to experiment with the scaling of the plot and the size 
of the fonts in order to get a good result. The scale argument to 
ggsaveQ can help you here (you can try out different values, like 



scale=1.3, scale=5, and so on). You can also use ggsaveQ to 
explicitly set the height and width of your plot in the units that 
you choose. 

ggsave(here("figures", "lifexp_vs_gdp_gradient.pdf"), plot = p_out, 
height = 8, width = 10, units = "in") 

Now that you know how to do that, let’s get back to making 
more graphs. 


3.8 Where to Go Next 

Start by playing around with the gapminder data a little more. You 

can try each of these explorations with geom_point() and then 

with geom_sinooth(), or both together. 

. What happens when you put the geom_smooth() function 
before geom_point() instead of after it? What does this tell you 
about how the plot is drawn? Think about how this might be 
useful when drawing plots. 

. Change the mappings in the aes() function so that you plot 
life expectancy against population (pop) rather than per capita 
GDP. What does that look like? What does it tell you about the 
unit of observation in the dataset? 

. Try some alternative scale mappings. Besides scale_x_log 
10(), you can try scale_x_sqrt() and scale_x_reverse(). 
There are corresponding functions for y-axis transformations. 
Just write y instead of x. Experiment with them to see what 
effect they have on the plot, and whether they make any sense 
to use. 

. What happens ifyoumap color to year instead of continent? 
Is the result what you expected? Think about what class of 
object year is. Remember you can get a quick look at the top of 
the data, which includes some shorthand information on the 
class of each variable, by typing gapminder. 

. Instead of mapping color = year, what happens if you try 
color = factor(year)? 

. As you look at these different scatterplots, think about 
figure 3.13 a little more critically. We worked it up to the point 
where it was reasonably polished, but is it really the best way to 
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display this country-year data? What are we gaining and los¬ 
ing by ignoring the temporal and country-level structure of the 
data? How could we do better? Sketch out what an alternative 
visualization might look like. 


This license does not extend to, for example, 
ovenwriting or deleting your data by mistake. You 
should still manage your project responsibly, at a 
minimum keeping good backups. But within R and 
at the level of experimenting with graphs at the 
console, you have a lot of freedom. 


As you begin to experiment, remember two things. First, it’s 
always worth trying something, even if you’re not sure what’s going 
to happen. Don’t be afraid of the console. The nice thing about 
making your graphics through code is that you won’t break any¬ 
thing you can’t reproduce. If something doesn’t work, you can 
figure out what happened, fix things, and rerun the code to make 
the graph again. 

Second, remember that the main flow of action in ggplot is 
always the same. You start with a table of data, you map the vari¬ 
ables you want to display to aesthetics like position, color, or shape, 
and you choose one or more geoms to draw the graph. In your 
code this gets accomplished by making an object with the basic 
information about data and mappings, and then adding or layering 
additional information as needed. Once you get used to this way of 
thinking about your plots, especially the aesthetic mapping part, 
drawing them becomes easier. Instead of having to think about 
how to draw particular shapes or colors on the screen, the many 
geom_ functions take care of that for you. In the same way, learning 
new geoms is easier once you think of them as ways to display aes¬ 
thetic mappings that you specify. Most of the learning curve with 
ggplot involves getting used to this way of thinking about your data 
and its representation in a plot. In the next chapter, we will flesh 
out these ideas a little more, cover some common ways plots go 
“wrong” (i.e., when they end up looking strange), and learn how 
to recognize and avoid those problems. 



4 Show the Right Numbers 


This chapter will continue to develop your fluency with ggplot’s 
central workflow while also expanding the range of things you 
can do with it. One of our goals is to learn how to make new 
kinds of graph. This means learning some new geoms, the func¬ 
tions that make particular kinds of plots. But we will also get 
a better sense of what ggplot is doing when it draws plots, and 
learn more about how to write code that prepares our data to be 
plotted. 

Code almost never works properly the first time you write 
it. This is the main reason that, when learning a new language, 
it is important to type out the exercises and follow along man¬ 
ually. It gives you a much better sense of how the syntax of the 
language works, where you’re likely to make errors, and what 
the computer does when that happens. Running into bugs and 
errors is frustrating, but it’s also an opportunity to learn a bit 
more. Errors can be obscure, but they are usually not malicious 
or random. If something has gone wrong, you can find out why it 
happened. 

In R and ggplot, errors in code can result in figures that don’t 
look right. We have already seen the result of one of the most 
common problems, when an aesthetic is mistakenly set to a con¬ 
stant value instead of being mapped to a variable. In this chapter 
we will discuss some useful features of ggplot that also commonly 
cause trouble. They have to do with how to tell ggplot more about 
the internal structure of your data {grouping ), how to break up 
your data into pieces for a plot {faceting), and how to get ggplot 
to perform some calculations on or summarize your data before 
producing the plot {transforming). Some of these tasks are part 
of ggplot proper, so we will learn more about how geoms, with 
the help of their associated stat functions, can act on data before 
plotting it. As we shall also see, while it is possible to do a lot of 
transformation directly in ggplot, there can be more convenient 
ways to approach the same task. 
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We will see some alternatives to Cartesian 
coordinates later. 


4.1 Colorless Green Data Sleeps Furiously 

When you write ggplot code in R you are in effect trying to 
“say” something visually. It usually takes several iterations to say 
exactly what you mean. This is more than a metaphor here. The 
ggplot library is an implementation of the “grammar” of graph¬ 
ics, an idea developed by Wilkinson (2005). The grammar is a 
set of rules for producing graphics from data, taking pieces of 
data and mapping them to geometric objects (like points and 
lines) that have aesthetic attributes (like position, color, and size), 
together with further rules for transforming the data if needed 
(e.g., to a smoothed line), adjusting scales (e.g., to a log scale), and 
projecting the results onto a different coordinate system (usually 
Cartesian). 

A key point is that, like other rules of syntax, the grammar lim¬ 
its the structure of what you can say, but it does not automatically 
make what you say sensible or meaningful. It allows you to pro¬ 
duce long “sentences” that begin with mappings of data to visual 
elements and add clauses about what sort of plot it is, how the axes 
are scaled, and so on. But these sentences can easily be garbled. 
Sometimes your code will not produce a plot at all because of some 
syntax error in R. You will forget a + sign between geom_ functions 
or lose a parenthesis somewhere so that your function statement 
becomes unbalanced. In those cases R will complain (perhaps in 
an opaque way) that something has gone wrong. At other times, 
your code will successfully produce a plot, but it will not look the 
way you expected it to. Sometimes the results will look very weird 
indeed. In those cases, the chances are you have given ggplot a 
series of grammatically correct instructions that are either non¬ 
sensical in some way or have accidentally twisted what you meant 
to say. These problems often arise when ggplot does not have quite 
all the information it needs in order make your graphic say what 
you want it to say. 


4.2 Grouped Data and the "Group" Aesthetic 


Let’s begin again with our Gapminder dataset. Imagine we wanted 
to plot the trajectory of life expectancy over time for each coun¬ 
try in the data. We map year to x and lifeExp to y. We take a 
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quick look at the documentation and discover that geom_line() 
will draw lines by connecting observations in order of the variable 
on the x-axis, which seems right. We write our code: 

p •<- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) 
p + geom_line() 


Something has gone wrong in figure 4.1. What happened? 
While ggplot will make a pretty good guess as to the structure of 
the data, it does not know that the yearly observations in the data 
are grouped by country. We have to tell it. Because we have not, 
geom_line () gamely tries to join up all the lines for each particular 
year in the order they appear in the dataset, as promised. It starts 
with an observation for 1952 in the first row of the data. It doesn’t 
know this belongs to Afghanistan. Instead of going to Afghanistan 
1953, it finds there are a series of 1952 observations, so it joins all 
those up first, alphabetically by country, all the way down to the 
1952 observation that belongs to Zimbabwe. Then it moves to the 
first observation in the next year, 1957. 

The result is meaningless when plotted. Bizarre-looking out¬ 
put in ggplot is common enough because everyone works out their 
plots one bit at a time, and making mistakes is just a feature of puz¬ 
zling out how you want the plot to look. When ggplot successfully 
makes a plot but the result looks insane, the reason is almost always 
that something has gone wrong in the mapping between the data 
and aesthetics for the geom being used. This is so common there’s 
even a Twitter account devoted to the “Accidental aRt” that results. 
So don’t despair! 

In this case, we can use the group aesthetic to tell ggplot 
explicitly about this country-level structure. 



year 


Figure 4.1: Trying to plot the data overtime by 
country. 


This would have worked if there were only one 
country in the dataset. 


p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) 
p + geom_line(aes(group = country)) 


The plot in figure 4.2 is still fairly rough, but it is showing 
the data properly, with each line representing the trajectory of a 
country over time. The gigantic outlier is Kuwait, in case you are 
interested. 

The group aesthetic is usually only needed when the group¬ 
ing information you need to tell ggplot about is not built into the 
variables being mapped. For example, when we were plotting the 



Figure 4.2: Plotting the data over time by country, 
again. 
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Figure 4.3: Faceting by continent. 


points by continent, mapping color to continent was enough to 
get the right answer because continent is already a categorical 
variable, so the grouping is clear. When mapping x to year, how¬ 
ever, there is no information in the year variable itself to let ggplot 
know that it is grouped by country for the purposes of drawing 
lines with it. So we need to say that explicitly. 


4.3 Facet to Make Small Multiples 

The plot we just made has a lot of lines on it. While the overall 
trend is more or less clear, it looks a little messy. One option is to 
facet the data by some third variable, making a “small multiple” 
plot. This is a powerful technique that allows a lot of information 
to be presented compactly and in a consistently comparable way. 
A separate panel is drawn for each value of the faceting variable. 
Facets are not a geom but rather a way of organizing a series of 
geoms. In this case we have the continent variable available to us. 
We will use facet_wrap() to split our plot by continent. 

The facet_wrap() function can take a series of arguments, 
but the most important is the first one, which is specified using 
R’s “formula” syntax, which uses the tilde character, ~. Facets are 
usually a one-sided formula. Most of the time you will just want 
a single variable on the right side of the formula. But faceting is 
powerful enough to accommodate what are in effect the graphical 
equivalent of multiway contingency tables, if your data is complex 
enough to require that. For our first example, we will just use a 
single term in our formula, which is the variable we want the data 
broken up by: facet_wrap(~ continent). 

p ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) 

p + geom_line(aes(group = country)) + facet_wrap(~continent) 

Each facet is labeled at the top. The overall layout of figure 4.3 
minimizes the duplication of axis labels and other scales. Remem¬ 
ber, too, that we can still include other geoms as before, and they 
will be layered within each facet. We can also use the ncol argu¬ 
ment to facet_wrap() to control the number of columns used to 
lay out the facets. Because we have only five continents, it might be 
worth seeing if we can fit them on a single row (which means we’ll 
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Figure 4.4: Faceting by continent, again. 


have five columns). In addition, we can add a smoother, and a few 
cosmetic enhancements that make the graph a little more effective 
(fig. 4.4). In particular we will make the country trends a light gray 
color. We need to write a little more code to make all this happen. 
If you are unsure of what each piece of code does, take advantage 
of ggplot’s additive character. Working backward from the bottom 
up, remove each + some_f unction(...) statement one at a time 
to see how the plot changes. 


p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) 
p + geom_line(color="gray70", aes(group = country)) + 

geom_smooth(size =1.1, method = "loess", se = FALSE) + 
scale_y_log10(labels=scales::dollar) + 
facet_wrap(~ continent, ncol = 5) + 
labs(x = "Year", 

y = "GDP per capita", 

title = "GDP per capita on Five Continents") 


This plot brings together an aesthetic mapping of X and y vari- We could also have faceted by country, which 

ables, a grouping aesthetic (country), two geoms (a lineplot and a would have madethe 9 rou P mapping superfluous. 
b r b b r But that would make almost 150 panels. 

smoother), a log-transformed y-axis with appropriate tick labels, 

a faceting variable (continent), and finally axis labels and a title. 

The facet_wrap() function is best used when you want a 
series of small multiples based on a single categorical variable. 

Your panels will be laid out in order and then wrapped into a 
grid. If you wish you can specify the number of rows or the 
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To begin with, we will use the GSS data in a slightly 
naive way. In particular we will not consider sample 
weights when making the figures in this chapter. In 
chapter 6 we will learn how to calculate frequencies 
and other statistics from data with a complex or 
weighted survey design. 


number of columns in the resulting layout. Facets can be more 
complex than this. For instance, you might want to cross-classify 
some data by two categorical variables. In that case you should try 
facet_grid() instead. This function will lay out your plot in a 
true two-dimensional arrangement, instead of a series of panels 
wrapped into a grid. 

To see the difference, let’s introduce gss_sm, a new dataset 
that we will use in the next few sections, as well as later on in 
the book. It is a small subset of the questions from the 2016 Gen¬ 
eral Social Survey, or GSS. The GSS is a long-running survey 
of American adults that asks about a range of topics of interest 
to social scientists. The gapminder data consists mostly of con¬ 
tinuous variables measured within countries by year. Measures 
like GDP per capita can take any value across a large range and 
they vary smoothly. The only categorical grouping variable is 
continent. It is an unordered categorical variable. Each country 
belongs to one continent, but the continents themselves have no 
natural ordering. By contrast, the GSS contains many categorical 
measures. 

In social scientific work, especially when analyzing individual- 
level survey data, we often work with categorical data of various 
kinds. Sometimes the categories are unordered, as with ethnicity 
or sex. But they may also be ordered, as when we measure high¬ 
est level of education attained on a scale ranging from elementary 
school to postgraduate degree. Opinion questions may be asked in 
yes-or-no terms, or on a five- or seven-point scale with a neutral 
value in the middle. Meanwhile, many numeric measures, such as 
number of children, may still take only integer values within a rela¬ 
tively narrow range. In practice these too maybe treated as ordered 
categorical variables running from zero to some top-coded value 
such as “Six or more.” Even properly continuous measures, such as 
income, are rarely reported to the dollar and are often obtainable 
only as ordered categories. The GSS data in gss_sm contains many 
measures of this sort. You can take a peek at it, as usual, by typ¬ 
ing its name at the console. You could also try glimpse(gss_sin), 
which will give a compact summary of all the variables in 
the data. 

We will make a smoothed scatterplot (fig. 4.5) of the relation¬ 
ship between the age of the respondent and the number of children 
they have. In gss_sm the Childs variable is a numeric count of the 
respondent’s children. (There is also a variable named kids that 
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Figure 4.5: Faceting on two categorical variables. 
Each panel plots the relationship between age and 
number of children, with the facets breaking out the 
data by sex (in the rows) and race (in the columns). 


is the same measure, but its class is an ordered factor rather than 
a number.) We will then facet this relationship by sex and race of 
the respondent. We use R’s formula notation in the facet_grid 
function to facet sex and race. This time, because we are cross- 
classifying our results, the formula is two-sided: facet_grid(sex 
~ race). 


p ggplot(data = gss_sm, 

mapping = aes(x = age, y = childs)) 
p + geom_point(alpha = 0.2) + 
geom_smooth() + 
facet_grid(sex ~ race) 

Multipanel layouts of this kind are especially effective when 
used to summarize continuous variation (as in a scatterplot) across 
two or more categorical variables, with the categories (and hence 
the panels) ordered in some sensible way. We are not limited to 
two-way comparison. Further categorical variables can be added 
to the formula, too, (e.g., sex ~ race + degree) for more com¬ 
plex multiway plots. However, the multiple dimensions of plots 
like this will quickly become very complicated if the variables have 
more than a few categories each. 
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4.4 Geoms Can Transform Data 


Tryp + stat_smooth(), for example. 



Figure 4.6: A bar chart. 


We have already seen several examples where geom_smooth() was 
included as a way to add a trend line to the figure. Sometimes 
we plotted a LOESS line, sometimes a straight line from an OLS 
regression, and sometimes the result of a Generalized Additive 
Model. We did not have to have any strong idea of the differ¬ 
ences between these methods. Neither did we have to write any 
code to specify the underlying models, beyond telling the method 
argument in geom_smooth() which one we wanted to use. The 
geom_smooth() function did the rest. 

Thus some geoms plot our data directly on the figure, as is the 
case with geom_point(), which takes variables designated as x and 
y and plots the points on a grid. But other geoms clearly do more 
work on the data before it gets plotted. Every geom_ function has an 
associated stat_ function that it uses by default. The reverse is also 
the case: every stat_ function has an associated geom_ function 
that it will plot by default if you ask it to. This is not particularly 
important to know by itself, but as we will see in the next section, 
we sometimes want to calculate a different statistic for the geom 
from the default. 

Sometimes the calculations being done by the stat_ functions 
that work together with the geom_ functions might not be imme¬ 
diately obvious. For example, consider figure 4.6, produced by a 
new geom, geom_bar(). 

p ggplot(data = gss_sm, mapping = aes(x = bigregion)) 

p + geom_bar() 

Here we specified just one mapping, aes(x = bigregion). 
The bar chart produced gives us a count of the number of (indi¬ 
vidual) observations in the data set by region of the United States. 
This seems sensible. But there is a y-axis variable here, count, that 
is not in the data. It has been calculated for us. Behind the scenes, 
geom_bar called the default stat_ function associated with it, 
stat_count(). This function computes two new variables, count 
and prop (short for proportion). The count statistic is the one 
geom_bar() uses by default. 


p «- ggplot(data = gss_sm, mapping = aes(x = bigregion)) 
p + geom_bar(mapping = aes(y = ..prop..)) 
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If we want a chart of relative frequencies rather than counts, 
we will need to get the prop statistic instead. When ggplot calcu¬ 
lates the count or the proportion, it returns temporary variables 
that we can use as mappings in our plots. The relevant statistic is 
called . .prop. . rather than prop. To make sure these temporary 
variables wont be confused with others we are working with, their 
names begin and end with two periods. (This is because we might 
already have a variable called count or prop in our dataset.) So our 
calls to it from the aes() function will generically look like this: 
<mapping> = <. .statistic. .>. In this case, we want y to use the 
calculated proportion, so we say aes(y = ..prop..). 

The resulting plot in figure 4.7 is still not right. We no longer 
have a count on the y-axis, but the proportions of the bars all have 
a value of 1, so all the bars are the same height. We want them 
to sum to 1, so that we get the number of observations per con¬ 
tinent as a proportion of the total number of observations, as in 
figure 4.8. This is a grouping issue again. In a sense, it’s the reverse 
of the earlier grouping problem we faced when we needed to tell 
ggplot that our yearly data was grouped by country. In this case, 
we need to tell ggplot to ignore the x-categories when calculating 
denominator of the proportion and use the total number observa¬ 
tions instead. To do so we specify group = 1 inside the aes() call. 
The value of 1 is just a kind of “dummy group” that tells ggplot to 
use the whole dataset when establishing the denominator for its 
prop calculations. 

p 4- ggplot(data = gss_sm, mapping = aes(x = bigregion)) 
p + geom_bar(mapping = aes(y = ..prop.., group = 1)) 

Let’s look at another question from the survey. The gss_sm 
data contains a religion variable derived from a question ask¬ 
ing “What is your religious preference? Is it Protestant, Catholic, 
Jewish, some other religion, or no religion?” 

table(gss_sm$religion) 



bigregion 

Figure 4.7: A first go at a bar chart with proportions. 



Figure 4.8: A bar chart with correct proportions. 


## 

## Protestant Catholic Jewish None Other 

## 1371 649 51 619 159 

To graph this, we want a bar chart with religion on the x Recall that the $ character is one way of accessing 

axis (as a categorical variable), and with the bars in the chart also individual columns within a data frame ortibble. 
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Figure 4.9: GSS religious preference mapped to 
color (top) and both color and fill (bottom). 


colored by religion. If the gray bars look boring and we want to 
fill them with color instead, we can map the religion variable to 
fill in addition to mapping it to x. Remember, fill is for painting 
the insides of shapes. If we map religion to color, only the border 
lines of the bars will be assigned colors, and the insides will remain 
gray 


P *- ggplot(data = gss_sm, mapping = aes(x = religion, color = religion)) 
p + geom_bar() 

P ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion)) 
p + geom_bar() + guides (fill = FALSE) 


By doing this, we have mapped two aesthetics to the same vari¬ 
able. Both x and fill are mapped to religion. There is nothing 
wrong with this. However, these are still two separate mappings, 
and so they get two separate scales (fig. 4.9). The default is to show 
a legend for the color variable. This legend is redundant because 
the categories of religion are already separated out on the x-axis. 
In its simplest use, the guides () function controls whether guid¬ 
ing information about any particular mapping appears or not. If we 
set guides (fill = FALSE), the legend is removed, in effect saying 
that the viewer of the figure does not need to be shown any guid¬ 
ing information about this mapping. Setting the guide for some 
mapping to FALSE has an effect only if there is a legend to turn off 
to begin with. Trying x = FALSE or y = FALSE will have no effect, 
as these mappings have no additional guides or legends separate 
from their scales. It is possible to turn the x and y scales off alto¬ 
gether, but this is done through a different function, one from the 
scale, family. 


4.5 Frequency Plots the Slightly Awkward Way 

A more appropriate use of the fill aesthetic with geom.barQ is to 
cross-classify two categorical variables. This is the graphical equiv¬ 
alent of a frequency table of counts or proportions. Using the GSS 
data, for instance, we might want to examine the distribution of 
religious preferences within different regions of the United States. 
In the next few paragraphs we will see how to do this just using 
ggplot. However, as we shall also discover, it is often not the most 
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transparent way to make frequency tables of this sort. The next 
chapter introduces a simpler and less error-prone approach where 
we calculate the table first before passing the results along to ggplot 
to graph. As you work through this section, bear in mind that if 
you find things slightly awkward or confusing it is because that’s 
exactly what they are. 

Let’s say we want to look at religious preference by census 
region. That is, we want the religion variable broken down pro¬ 
portionally within bigregion. When we cross-classify categories 
in bar charts, there are several ways to display the results. With 
geom_bar() the output is controlled by the position argument. 
Let’s begin by mapping fill to religion. 

P •*- ggplotfdata = gss_sm, mapping = aes(x = bigregion, fill = religion)) 
p + geom_bar() 

The default output of geom_bar() is a stacked bar chart 
(fig. 4.10) with counts on the y-axis (and hence counts within the 
stacked segments of the bars also). Region of the country is on 
the x-axis, and counts of religious preference are stacked within 
the bars. As we saw in chapter 1, it is somewhat difficult for readers 
of the chart to compare lengths and areas on an unaligned scale. 
So while the relative position of the bottom categories are quite 
clear (thanks to them all being aligned on the x-axis), the relative 
positions of, say, the “Catholic” category is harder to assess. An 
alternative choice is to set the position argument to "fill." (This 
is different from the fill aesthetic.) 

P *- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) 
p + geom_bar(position = "fill") 

Now (fig. 4.11) the bars are all the same height, which makes 
it easier to compare proportions across groups. But we lose the 
ability to see the relative size of each cut with respect to the over¬ 
all total. What if we wanted to show the proportion or percentage 
of religions within regions of the country, like in figure 4.11, but 
instead of stacking the bars we wanted separate bars instead? As 
a first attempt, we can use position = "dodge" to make the bars 
within each region of the country appear side by side. However, if 
we do it this way (try it), we will find that ggplot places the bars 
side-by-side as intended but changes the y-axis back to a count of 
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Figure 4.10: A stacked bar chart of religious 
preference by census region. 
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Figure 4.11: Using the fill position adjustment to 
show relative proportions across categories. 
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Figure 4.12: A first go at a dodged bar chart with 
proportional bars. 
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Figure 4.13: A second attempt at a dodged bar 
chart with proportional bars. 


Proportions for smaller subpopulations tend to 
bounce around from year to year in the GSS. 


cases within each category rather than showing us a proportion. 
We saw in figure 4.8 that to display a proportion we needed to map 
y = . .prop. ., so the correct statistic wouldbe calculated. Let’s see 
if that works. 


P ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) 
p + geom_bar(position = "dodge", mapping = aes(y = ..prop..)) 


The result (fig. 4.12) is certainly colorful but not what we 
wanted. Just as in figure 4.7, there seems to be an issue with the 
grouping. When we just wanted the overall proportions for one 
variable, we mapped group = 1 to tell ggplot to calculate the pro¬ 
portions with respect to the overall N. In this case our grouping 
variable is religion, so we might try mapping that to the group 
aesthetic. 


P ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) 
p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., 
group = religion)) 


This gives us a bar chart where the values of religion are bro¬ 
ken down across regions, with a proportion showing on the y-axis. 
If you inspect the bars in figure 4.13, you will see that they do not 
sum to one within each region. Instead, the bars for any particular 
religion sum to one across regions. 

This lets us see that nearly half of those who said they were 
Protestant live in the South, for example. Meanwhile, just over 
10 percent of those saying they were Protestant live in the North¬ 
east. Similarly, it shows that over half of those saying they were 
Jewish live in the Northeast, compared to about a quarter who live 
in the South. 

We are still not quite where we originally wanted to be. Our 
goal was to take the stacked bar chart in Figure 4.10 but have the 
proportions shown side-by-side instead of on top of one another. 

p ggplot(data = gss_sm, mapping = aes(x = religion)) 
p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., 

group = bigregion)) + facet_wrap(~bigregion, ncol = 1) 

It turns out that the easiest thing to do is to stop trying to force 
geom_bar() to do all the work in a single step. Instead, we can ask 
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Figure 4.14: Faceting proportions within region. 


ggplot to give us a proportional bar chart of religious affiliation, 
and then facet that by region. The proportions in figure 4.14 are 
calculated within each panel, which is the breakdown we wanted. 
This has the added advantage of not producing too many bars 
within each category. 

We could polish this plot further, but for the moment we will 
stop here. When constructing frequency plots directly in ggplot, 
it is easy to get stuck in a cycle of not quite getting the marginal 
comparison that you want, and more or less randomly poking at 
the mappings to try to stumble on the right breakdown. In the next 
chapter, we will learn how to use the tidyverse’s dplyr library to 
produce the tables we want before we try to plot them. This is a 
more reliable approach, and easier to check for errors. It will also 
give us tools that can be used for many more tasks than producing 
summaries. 


4.6 Histograms and Density Plots 

Different geoms transform data in different ways, but ggplot’s 
vocabulary for them is consistent. We can see similar transfor¬ 
mations at work when summarizing a continuous variable using 
a histogram, for example. A histogram is a way of summarizing a 
continuous variable by chopping it up into segments or “bins” and 
counting how many observations are found within each bin. In a 
bar chart, the categories are given to us going in (e.g., regions of 
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the country, or religious affiliation). With a histogram, we have to 
decide how finely to bin the data. 

For example, ggplot comes with a dataset, midwest, containing 
information on counties in several midwestern states of the United 
States. Counties vary in size, so we can make a histogram show¬ 
ing the distribution of their geographical areas. Area is measured 
in square miles. Because we are summarizing a continuous vari¬ 
able using a series of bars, we need to divide the observations into 
groups, or bins, and count how many are in each one. By default, 
the geom_histogram() function will choose a bin size for us based 
on a rule of thumb. 


p ggplot(data = midwest, mapping = aes(x = area)) 
p + geom_histogram() 


## 'stat_bin()' using 'bins =30'. Pick better value with 
## 'binwidth'. 


p ggplot(data = midwest, mapping = aes(x = area)) 
p + geom_histogram(bins = 10) 


As with the bar charts, a newly calculated variable, count, 
appears on the x-axis in figure 4.15. The notification from R tells 
us that behind the scenes the stat_bin() function picked thirty 
bins, but we might want to try something else. When drawing his¬ 
tograms it is worth experimenting with bins and also optionally 
the origin of the x-axis. Each, especially bins, will make a big 
difference in how the resulting figure looks. 

While histograms summarize single variables, it’s also possible 
to use several at once to compare distributions. We can facet his¬ 
tograms by some variable of interest, or as here, we can compare 
them in the same plot using the fill mapping (fig. 4.16). 


Figure 4.15: Histograms of the same variable, using 
different numbers of bins. 
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oh_wi 4 - c("0H", "WI") 

p •<- ggplot(data = subset(midwest, subset = state '/.in'/. oh_wi), 
mapping = aes(x = percollege, fill = state)) 
p + geom_histogram(alpha = 0.4, bins = 20) 


We subset the data here to pick out just two states. To do this 
we create a character vector with just two elements, “OH” and 
“WI.” Then we use the subsetQ function to take our data and fil¬ 
ter it so that we select only rows whose state name is in this vector. 
The XinX operator is a convenient way to filter on more than one 
term in a variable when using subsetQ. 

When working with a continuous variable, an alternative to 
binning the data and making a histogram is to calculate a ker¬ 
nel density estimate of the underlying distribution. The geom_ 
density 0 function will do this for us (fig. 4.17). 

p <- ggplot(data = midwest, mapping = aes(x = area)) 
p + geom_density() 

We can use color (for the lines) and fill (for the body of the 
density curve) here, too. These figures often look quite nice. But 
when there are several filled areas on the plot, as in this case, the 
overlap can become hard to read. (Figures 4.18 and 4.19 are exam¬ 
ples.) If you want to make the baselines of the density curves go 
away, you can use geom_line(stat = "density") instead. This 
also removes the possibility of using the fill aesthetic. But this may 
be an improvement in some cases. Try it with the plot of state areas 
and see how they compare. 


State 
OH wi 



Percollege 

Figure 4.16: Comparing two histograms. 



Figure 4.17: Kernel density estimate of county areas. 


P ■*- ggplot(data = midwest, mapping = aes(x = area, fill = state, 
color = state)) 
p + geom_density(alpha = 0.3) 

Just like geom_bar(), the count-based defaults computed 
by the stat_ functions used by geom_histogramO and geom_ 
density () will return proportional measures if we ask them. 
For geom_density (), the stat_density () function can return its 
default . .density. . statistic, or ..scaled.., which will give a 
proportional density estimate. It can also return a statistic called 
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.. count .., which is the density times the number of points. This 
can be used in stacked density plots. 


State 

□ IL QIN □ Ml □OH □ Wl 



Figure 4.18: Comparing distributions. 


State 

□ OH Qwi 



p <- ggplot(data = subset(midwest, subset = state '/.in'/. oh_wi), 
mapping = aes(x = area, fill = state, color = state)) 
p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..))) 


4.7 Avoid Transformations When Necessary 

As we have seen from the beginning, ggplot normally makes its 
charts starting from a full dataset. When we call geom_bar() it 
does its calculations on the fly using stat_count() behind the 
scenes to produce the counts or proportions it displays. In the pre¬ 
vious section, we looked at a case where we wanted to group and 
aggregate our data ourselves before handing it off to ggplot. But 
often our data is, in effect, already a summary table. This can hap¬ 
pen when we have computed a table of marginal frequencies or 
percentages from the original data. Plotting results from statisti¬ 
cal models also puts us in this position, as we will see later. Or it 
may be that we just have a finished table of data (from the census, 
say, or an official report) that we want to make into a graph. For 
example, perhaps we do not have the individual-level data on who 
survived the Titanic disaster, but we do have a small table of counts 
of survivors by sex: 


Figure 4.19: Scaled densities. 


titanic 


Sex 

■ Female ■ Male 



## fate 

sex 

n 

percent 

## 1 perished 

male 

1364 

62.0 

## 2 perished female 

126 

5.7 

## 3 survived 

male 

367 

16.7 

## 4 survived 

female 

344 

15.6 


Because we are working directly with percentage values in a 
summary table, we no longer have any need for ggplot to count up 
values for us or perform any other calculations. That is, we do not 
need the services of any stat_ functions that geom_bar() would 
normally call. We can tell geom_bar() not to do any work on the 
variable before plotting it. To do this we say stat = 1 identity ’ 
in the geom_bar() call. We’ll also move the legend to the top of the 
chart (fig. 4.20). 


Figure 4.20: Survival on the Titanic, by sex. 
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P ■*- ggplotfdata = titanic, mapping = aes(x = fate, y = percent, 
fill = sex)) 

p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top") 


For convenience ggplot also provides a related geom, 
geom_col(), which has exactly the same effect but assumes that 
stat = "identity ." We will use this form in the future when we 
don’t need any calculations done on the plot. 

The position argument in geom_bar() and geom_col() can 
also take the value of "identity." Just as stat = "identity" 
means “don’t do any summary calculations,” position = 
"identity" means “just plot the values as given.” This allows us 
to do things like plotting a flow of positive and negative values 
in a bar chart. This sort of graph is an alternative to a lineplot 
and is often seen in public policy settings where changes relative 
to some threshold level or baseline are of interest. For example, 
the oecd_sum table in socviz contains information on average 
life expectancy at birth within the United States and across other 
OECD countries. 

oecd_sum 


## # A tibble: 57 x 5 
## # Groups: year [57] 


## 


year 

other 

usa 

diff 

hi_lo 

## 


<int> 

<dbl> 

<dbl> 

<dbl> 

<chr> 

## 

1 

1960 

68.6 

69.9 

1.30 

Below 

## 

2 

1961 

69.2 

70.4 

1.20 

Below 

## 

3 

1962 

68.9 

70.2 

1.30 

Below 

## 

4 

1963 

69.1 

70.0 

0.900 

Below 

## 

5 

1964 

69.5 

70.3 

0.800 

Below 

## 

6 

1965 

69.6 

70.3 

0.700 

Below 

## 

7 

1966 

69.9 

70.3 

0.400 

Below 

## 

8 

1967 

70.1 

70.7 

0.600 

Below 

## 

9 

1968 

70.1 

70.4 

0.300 

Below 

## 10 

1969 

70.1 

70.6 

0.500 

Below 


## # ... with 47 more rows 

The other column is the average life expectancy in a given year 
for OECD countries, excluding the United States. The usa column 
is the U.S. life expectancy, diff is the difference between the two 
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The US life expectancy gap 

Difference between US and OECD average life expectancies, 1960-2015 


III- l_ 

JiiiLiih. 

IIIII1.1..I11I 

lllhiM. _____ 


■■■M||||| 

■ 



Ml 


I960 1980 2000 

Data: OECD. After a chart by Christopher Ingraham, 
Washington Post, December 27th 2017. 


Figure 4.21: Using geom_col() to plot negative and positive values in a barchart. 


values, and hi_lo indicates whether the U.S. value for that year was 
above or below the OECD average. We will plot the difference over 
time and use the hi_lo variable to color the columns in the chart 
(fig. 4.21). 

p <- ggplot(data = oeccLsum, 

mapping = aes(x = year, y = diff, fill = hi_lo)) 
p + geom_col() + guides(fill = FALSE) + 
labs(x = NULL, y = "Difference in Years", 
title = "The US Life Expectancy Gap", 
subtitle = "Difference between US and OECD 

average life expectancies, 1960-2015", 
caption = "Data: OECD. After a chart by Christopher Ingraham, 
Washington Post, December 27th 2017.") 

As with the titanic plot, the default action of geom_col() is 
to set both stat and position to “identity.” To get the same effect 
with geom_bar() we would need to say geom_bar(position = 
"identity"). As before, the guides (fill = FALSE) instruction at 
the end tells ggplot to drop the unnecessary legend that would oth¬ 
erwise be automatically generated to accompany the fill mapping. 

At this point, we have a pretty good sense of the core steps we 
must take to visualize our data. In fact, thanks to ggplot’s default 
settings, we now have the ability to make good-looking and infor¬ 
mative plots. Starting with a tidy dataset, we know how to map 
variables to aesthetics, choose from a variety of geoms, and make 
some adjustments to the scales of the plot. We also know more 
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about selecting the right sort of computed statistic to show on 
the graph, if that’s what’s needed, and how to facet our core plot by 
one or more variables. We know how to set descriptive labels for 
axes and write a title, subtitle, and caption. Now we’re in a position 
to put these skills to work in a more fluent way. 


4.8 Where to Go Next 

. Revisit the gapminder plots at the beginning of the chapter 
and experiment with different ways to facet the data. Try plot¬ 
ting population and per capita GDP while faceting on year, or 
even on country. In the latter case you will get a lot of panels, 
and plotting them straight to the screen may take a long time. 
Instead, assign the plot to an object and save it as a PDF file to 
your figures/ folder. Experiment with the height and width of 
the figure. 

. Investigate the difference between a formula written as facet, 
grid (sex ~ race) and one written as facet_grid(~ sex + 
race). 

. Experiment to see what happens when you use facet.wrapQ 
with more complex formulas like facet.w rap (~ sex + race) 
instead of facet.grid. Like facet_grid(), the facet. 
wrap() function can facet on two or more variables at once. 
But it will do it by laying the results out in a wrapped one¬ 
dimensional table instead of a fully cross-classified grid. 

. Frequency polygons are closely related to histograms. Instead 
of displaying the count of observations using bars, they display 
it with a series of connected lines. You can try the various geom. 
histogram () calls in this chapter using geom.freqpoly() 
instead. 

. A histogram bins observations for one variable and shows a 
bar with the count in each bin. We can do this for two variables 
at once, too. The geom_bin2d() function takes two mappings, 
x and y. It divides your plot into a grid and colors the bins 
by the count of observations in them. Try using it on the 
gapminder data to plot life expectancy versus per capita GDP. 
Like a histogram, you can vary the number or width of the bins 
for both x or y. Instead of saying bins = 30orbinwidth = 1, 
provide a number for both x and y with, for example, bins = 
c(20, 50). If you specify binwidth instead, you will need to 



pick values that are on the same scale as the variable you are 
mapping. 

Density estimates can also be drawn in two dimensions. The 
geom_density_2d() function draws contour lines estimat¬ 
ing the joint distribution of two variables. Try it with the 
midwest data, for example, plotting percent below the poverty 
line (percbelowpoverty) against percent college-educated 
(percollege). Try it with and without a geom_point() layer. 



5 Graph Tables, Add Labels, 
Make Notes 


This chapter builds on the foundation we have laid down. Things 
will get a little more sophisticated in three ways. First, we will 
learn about how to transform data before we send it to ggplot to 
be turned into a figure. As we saw in chapter 4, ggplot’s geoms will 
often summarize data for us. While convenient, this can some¬ 
times be awkward or even a little opaque. Often it’s better to get 
things into the right shape before we send anything to ggplot. This 
is a job for another tidyverse component, the dplyr library. We 
will learn how to use some of its “action verbs” to select, group, 
summarize, and transform our data. 

Second, we will expand the number of geoms we know about 
and learn more about how to choose between them. The more we 
learn about ggplot’s geoms, the easier it will be to pick the right 
one given the data we have and the visualization we want. As we 
learn about new geoms, we will also get more adventurous and 
depart from some of ggplot’s default arguments and settings. We 
will learn how to reorder the variables displayed in our figures, and 
how to subset the data we use before we display it. 

Third, this process of gradual customization will give us the 
opportunity to learn more about the scale, guide, and theme func¬ 
tions that we have mostly taken for granted until now. These will 
give us even more control over the content and appearance of 
our graphs. Together, these functions can be used to make plots 
much more legible to readers. They allow us to present our data 
in a more structured and easily comprehensible way, and to pick 
out the elements of it that are of particular interest. We will begin 
to use these methods to layer geoms on top of one another, a 
technique that will allow us to produce sophisticated graphs in a 
systematic, comprehensible way. 

Our basic approach will not change. No matter how complex 
our plots get, or how many individual steps we take to layer and 
tweak their features, underneath we will always be doing the same 
thing. We want a table of tidy data, a mapping of variables to 
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aesthetic elements, and a particular type of graph. If you can keep 
sight of this, it will make it easier to confidently approach the job 
of getting any particular graph to look just right. 


5.1 Use Pipes to Summarize Data 

In chapter 4 we began making plots of the distributions and relative 
frequencies of variables. Cross-classifying one measure by another 
is one of the basic descriptive tasks in data analysis. Tables 5.1 
and 5.2 show two common ways of summarizing our GSS data 
on the distribution of religious affiliation and region. Table 5.1 
shows the column marginals, where the numbers sum to a hun¬ 
dred by column and show, e.g., the distribution of Protestants 
across regions. Meanwhile in table 5.2 the numbers sum to a hun¬ 
dred across the rows, showing, for example, the distribution of 
religious affiliations within any particular region. 

We saw in chapter 4 that geom_ba r () can plot both counts and 
relative frequencies depending on what we asked of it. In practice, 
though, letting the geoms (and their stat_ functions) do the work 
can sometimes get a little confusing. It is too easy to lose track 
of whether one has calculated row margins, column margins, or 


Table 5.1 

Column marginals. (Numbers in columns sum to 100.) 



Protestant 

Catholic 

Jewish 

None 

Other 

NA 

Northeast 

12 

25 

53 

18 

18 

6 

Midwest 

24 

27 

6 

25 

21 

28 

South 

47 

25 

22 

27 

31 

61 

West 

17 

24 

20 

29 

30 

6 

Table 5.2 

Row marginals. (Numbers 

in rows sum 

to 100.) 





Protestant 

Catholic 

Jewish 

None 

Other 

NA 

Northeast 

32 

33 

6 

23 

6 

0 

Midwest 

47 

25 

0 

23 

5 

1 

South 

62 

15 

1 

16 

5 

1 

West 

38 

25 

2 

28 

8 

0 
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overall relative frequencies. The code to do the calculations on the 
fly ends up stuffed into the mapping function and can become hard 
to read. A better strategy is to calculate the frequency table you 
want first and then plot that table. This has the benefit of allowing 
you do to some quick sanity checks on your tables, to make sure 
you haven’t made any errors. 

Let’s say we want a plot of the row marginals for religion within 
region. We will take the opportunity to do a little bit of data mung- 
ing in order to get from our underlying table of GSS data to the 
summary tabulation that we want to plot. To do this we will use 
the tools provided by dply r, a component of the tidyverse that pro¬ 
vides functions for manipulating and reshaping tables of data on 
the fly. We start from our individual-level gss_sm data frame with 
its bigregion and religion variables. Our goal is a summary table 
with percentages of religious preferences grouped within region. 

As shown schematically in figure 5.1, we will start with our 
individual-level table of about 2,500 GSS respondents. Then we 
want to summarize them into a new table that shows a count of 
each religious preference, grouped by region. Finally we will turn 
these within-region counts into percentages, where the denomi¬ 
nator is the total number of respondents within each region. The 
dplyr library provides a few tools to make this easy and clear to 
read. We will use a special operator, 1>1, to do our work. This is the 
pipe operator. It plays the role of the yellow triangle in figure 5.1, in 
that it helps us perform the actions that get us from one table to 
the next. 


1. Individual-level 
GSS data on region 
and religion 


2. Summary count of 
religious preferences 
by census region 


3. Percent religious 
preferences by 
census region 


id 

bigregion 

religion 

1014 

Midwest 

Protestant 

1544 

South 

Protestant 

665 

Northeast 

None 

1618 

South 

None 

2115 

West 

Catholic 

417 

South 

Protestant 

2045 

West 

Protestant 

1863 

Northeast 

Other 

1884 

Midwest 

Christian 

1628 

South 

Protestant 



bigregion 

religion 

N 


bigregion 

religion 

N 

pet 

Northeast 

Protestant 

123 


Northeast 

Protestant 

123 

28.3 

Northeast 

Catholic 

149 


Northeast 

Catholic 

149 

34.3 

Northeast 

Jewish 

15 


Northeast 

Jewish 

15 

3.4 

Northeast 

None 

97 

Northeast 

None 

97 

22.3 

Northeast 

Christian 

14 


Northeast 

Christian 

14 

3.2 

Northeast 

Other 

31 


Northeast 

Other 

31 

7.1 


Figure 5.1: How we want to transform the individual-level data. 
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group_by() 


filter() rows; select() columns 
mutateQ 


summarizeO 


We have been building our plots in an additive fashion, start¬ 
ing with a ggplot object and layering on new elements. By analogy, 
think of the i>i operator as allowing us to start with a data 
frame and perform a sequence or pipeline of operations to turn 
it into another, usually smaller and more aggregated, table. Data 
goes in one side of the pipe, actions are performed via functions, 
and results come out the other. A pipeline is typically a series of 
operations that do one or more of four things: 

. Group the data into the nested structure we want for our sum¬ 
mary, such as “Religion by Region” or “Authors by Publications 
by Year.” 

. Filter or select pieces of the data by row, column, or both. This 
gets us the piece of the table we want to work on. 

. Mutate the data by creating new variables at the current level 
of grouping. This adds new columns to the table without 
aggregating it. 

. Summarize or aggregate the grouped data. This creates new 
variables at a higher level of grouping. For example we might 
calculate means with mean () or counts with n(). This results in 
a smaller, summary table, which we might further summarize 
or mutate if we want. 

We use the dplyr functions group_by(), filter(), select(), 
mutateO, and summarize() to carry out these tasks within our 
pipeline. They are written in a way that allows them to be eas¬ 
ily piped. That is, they understand how to take inputs from the 
left side of a pipe operator and pass results along through the 
right side of one. The dplyr documentation has some useful 
vignettes that introduce these grouping, filtering, selection, and 
transformation functions. There is also a more detailed discussion 
of these tools, along with many more examples, in Wickham & 
Grolemund (2016). 

We will create a new table called rel_by_region. Here’s the 
code: 


rel_by_region •<- gss_sm l>! 

group_by(bigregion, religion) 
summarize(N = n()) 
mutate(freq = N / sum(N), 

pet = round((freq*100), 0)) 
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What are these lines doing? First, we are creating an object as 
usual, with the familiar assignment operator, <-. Next comes the 
pipeline. Read the objects and functions from left to right, with 
the pipe operator connecting them together meaning “and 
then...Objects on the left side “pass through” the pipe, and what¬ 
ever is specified on the right of the pipe gets done to that object. 
The resulting object then passes through to the right again, and so 
on down to the end of the pipeline. 

Reading from the left, the code says this: 


. Create a new object, rel_by_ region. It will get the result of the 
following sequence of actions: Start with the gss_sm data, and 
then 

• Group the rows by bigregion and, within that, by religion. 

• Summarize this table to create a new, much smaller table, with 
three columns: bigregion, religion, and a new summary 
variable, N, that is a count of the number of observations within 
each religious group for each region. 

. With this new table, use the N variable to calculate two new 
columns: the relative proportion (f req) and percentage (pet) 
for each religious category, still grouped by region. Round the 
results to the nearest percentage point. 


rel_by_region <- gss_sm 


group_by(bigregion, religion) '/•>'/. 
summarized = n()) 


mutate(freq = N / sum(N), pet = 
round((freq*100), 0)) 


In this way of doing things, objects passed along the pipeline 
and the functions acting on them carry some assumptions about 
their context. For one thing, you don’t have to keep specifying the 
name of the underlying data frame object you are working from. 
Everything is implicitly carried forward from gss_sm. Within the 
pipeline, the transient or implicit objects created from your sum¬ 
maries and other transformations are carried through, too. 

Second, the group_by () function sets up how the grouped or 
nested data will be processed within the summarize () step. Any 
function used to create a new variable within summarizeQ, such 
as mean() or sd() or n(), will be applied to the innermost group¬ 
ing level first. Grouping levels are named from left to right within 
group_by() from outermost to innermost. So the function call 
summarize (N = n()) counts up the number of observations for 
each value of religion within bigregion and puts them in a new 
variable named N. As dplyr’s functions see things, summarizing 
actions peel off one grouping level at a time, so that the resulting 
summaries are at the next level up. In this case, we start with 
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As in the case ofaes(x = gdpPercap, y = 
lifeExp), for example. 


individual-level observations and group them by religion within 
region. The summarize () operation aggregates the individual 
observations to counts of the number of people affiliated with each 
religion, for each region. 

Third, the mutate () step takes the N variable and uses it to cre¬ 
ate f req, the relative frequency for each subgroup within region, 
and finally pet, the relative frequency turned into a rounded per¬ 
centage. These mutate () operations add or remove columns from 
tables but do not change the grouping level. 

Inside both mutateQ and summarizeQ, we are able to create 
new variables in a way that we have not seen before. Usually, when 
we see something like name = value inside a function, the name 
is a general, named argument and the function is expecting infor¬ 
mation from us about the specific value it should take. Normally 
if we give a function a named argument it doesn’t know about 
(aes(chuckles = year)), it will ignore it, complain, or break. 
With summarize () and mutate (), however, we can invent named 
arguments. We are still assigning specific values to N, freq, and 
pet, but we pick the names, too. They are the names that the newly 
created variables in the summary table will have. The summa rize () 
and mutate () functions do not need to know what they will be in 
advance. 

Finally, when we use mutate () to create the freq variable, not 
only can we make up that name within the function, mutate () is 
also clever enough to let us use that name right away, on the next 
line of the same function call, when we create the pet variable. This 
means we do not have to repeatedly write separate mutate () calls 
for every new variable we want to create. 

Our pipeline takes the gss_sm data frame, which has 2,867 
rows and 32 columns, and transforms it into rel_by_region, a 
summary table with 24 rows and 5 columns that looks like this, 
in part: 


rel_by_region 


## # A tibble: 24 x 5 

## # Groups: bigregion [4] 

## bigregion religion N freq pet 

## <fct> <fct> <int> <dbl> <dbl> 

## 1 Northeast Protestant 158 0.324 32. 

## 2 Northeast Catholic 162 0.332 33. 
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## 

3 

Northeast 

Jewish 

27 

0.0553 

6 

## 

4 

Northeast 

None 

112 

0.230 

23 

## 

5 

Northeast 

Other 

28 

0.0574 

6 

## 

6 

Northeast 

<NA> 

1 

0.00205 

0 

## 

7 

Midwest 

Protestant 

325 

0.468 

47 

## 

8 

Midwest 

Catholic 

172 

0.247 

25 

## 

9 

Midwest 

Jewish 

3 

0.00432 

0 

## 

10 

Midwest 

None 

157 

0.226 

23 

## 

# • 

.. with 14 more rows 





The variables specified in group_by() are retained in the 
new summary table; the variables created with summarize () and 
mutate () are added, and all the other variables in the original 
dataset are dropped. 

We said before that, when trying to grasp what each additive 
step in a ggplot() sequence does, it can be helpful to work back¬ 
ward, removing one piece at a time to see what the plot looks like 
when that step is not included. In the same way, when looking at 
pipelined code it can be helpful to start from the end of the line 
and then remove one step at a time to see what the resulting 
intermediate object looks like. For instance, what if we remove the 
mutate () step from the code above? What does rel_by_region 
look like then? What if we remove the summarize () step? How big 
is the table returned at each step? What level of grouping is it at? 
What variables have been added or removed? 

Plots that do not require sequential aggregation and transfor¬ 
mation of the data before they are displayed are usually easy to 
write directly in ggplot, as the details of the layout are handled by 
a combination of mapping variables and layering geoms. One-step 
filtering or aggregation of the data (such as calculating a propor¬ 
tion, or a specific subset of observations) is also straightforward. 
But when the result we want to display is several steps removed 
from the data, and in particular when we want to group or aggre¬ 
gate a table and do some more calculations on the result before 
drawing anything, then it can make sense to use dplyr’s tools to 
produce these summary tables first. This is true even if it would 
be possible to do it within a ggplot() call. In addition to mak¬ 
ing our code easier to read, pipes let us more easily perform sanity 
checks on our results, so that we are sure we have grouped and 
summarized things in the right order. For instance, if we have 
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done things properly with rel_by_region, the pet values associ¬ 
ated with religion should sum to 100 within each region, perhaps 
with a bit of rounding error. We can quickly check this using a very 
short pipeline: 

rel_by_region group_by(bigregion) /.>'/. summarize(total = sum(pct)) 

## # A tibble: 4x2 
## bigregion total 
## <fct> <dbl> 

## 1 Northeast 100. 

## 2 Midwest 101. 

## 3 South 100. 

## 4 West 101. 

This looks good. As before, now that we are working directly 
with percentage values in a summary table, we can use geom_col () 
instead of geom_bar(). 

P <- ggplot(rel_by_region, aes(x = bigregion, y = pet, fill = religion)) 
p + geouucol (position = "dodge2") + 

labs(x = "Region",y = "Percent", fill = "Religion") + 
theme(legend.position = "top") 


Try going back to the code for figure 4.13, in chapter We use a different position argument here, dodge2 instead 

4, and using this dodge2 argument instead of the 0 f dodge. This puts the bars side by side. When dealing with pre- 

"dodge” argument there. , 

computed values in geom_col(J, the default position is to make a 

proportionally stacked column chart. If you use dodge they will be 
stacked within columns, but the result will read incorrectly. Using 
dodge2 puts the subcategories (religious affiliations) side-by-side 
within groups (regions). 

The values in the bar chart in figure 5.2 are the percentage 
equivalents to the stacked counts in figure 4.10. Religious affilia¬ 
tions sum to 100 percent within region. The trouble is, although 
we now know how to cleanly produce frequency tables, this is still 
a bad figure! It is too crowded, with too many bars side by side. We 
can do better. 

As a rule, dodged charts can be more cleanly expressed as 
faceted plots. Faceting removes the need for a legend and thus 
makes the chart simpler to read. We also introduce a new function. 
If we map religion to the x-axis, the labels will overlap and become 
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■ Protestant ■ Jewish ■ Other 

Religion 

■ Catholic None ■ NA 


60 - 



Northeast Midwest South West 

Region 


Figure 5.2: Religious preferences by region. 


illegible. It’s possible to manually adjust the tick-mark labels so that 
they are printed at an angle, but that isn’t so easy to read, either. It 
makes more sense to put the religions on the y-axis and the percent 
scores on the x-axis. Because of the way geom_bar() works inter¬ 
nally, simply swapping the x and y mapping will not work. (Try 
it and see what happens.) What we do instead is to transform the 
coordinate system that the results are plotted in, so that the x and y 
axes are flipped. We do this with coord_flip(). 


p ggplot(rel_by_region, aes(x = religion, y = pet, fill = religion)) 
p + geom_col(position = "dodge2") + 

labs(x = NULL, y = "Percent", fill = "Religion") + 
guides(fill = FALSE) + 
coord_flip() + 
facet_grid(~ bigregion) 


For most plots the coordinate system is Cartesian, show¬ 
ing plots on a plane defined by an x-axis and a y-axis. The 
coord_cartesian() function manages this, but we don’t need to 
call it. The coord_flip() function switches the x and y axes after 
the plot is made. It does not remap variables to aesthetics. In this 
case, religion is still mapped to x and pet to y. Because the reli¬ 
gion names do not need an axis label to be understood, we set x = 
NULL in the labsQ call. (See fig. 5.3.) 

We will see more of what dplyr’s grouping and filtering oper¬ 
ations can do later. It is a flexible and powerful framework. For 
now, think of it as a way to quickly summarize tables of data 
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Figure 5.3: Religious preferences by region, faceted version. 


West 



without having to write code in the body of our ggplotQ or geom_ 
functions. 


5.2 Continuous Variables by Group or Category 

Let’s move to a new dataset, the organdata table. Like gapminder, 
it has a country-year structure. It contains a little more than a 
decade’s worth of information on the donation of organs for trans¬ 
plants in seventeen OECD countries. The organ procurement 
rate is a measure of the number of human organs obtained from 
cadaver organ donors for use in transplant operations. Along with 
this donation data, the dataset has a variety of numerical demo¬ 
graphic measures, and several categorical measures of health and 
welfare policy and law. Unlike the gapminder data, some obser¬ 
vations are missing. These are designated with a value of NA, R’s 
standard code for missing data. The organdata table is included 
in the socviz library. Load it up and take a quick look. Instead of 
using head (), for variety this time we will make a short pipeline to 
select the first six columns of the dataset and then pick five rows at 
random using a function called sample_n(). This function takes 
two main arguments. First we provide the table of data we want 
to sample from. Because we are using a pipeline, this is implicitly 
passed down from the beginning of the pipe. Then we supply the 

Using numbers this way in selectf) chooses the number of draws we want to make. 

numbered columns of the data frame. You can also 

select variable names directly. 
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organdata select(1:6) sample_n(size = 10) 
## # A tibble: 10x6 


## 

country 

year 

donors 

pop 

pop_dens 

gdp 

## 

<chr> 

<date> 

<dbl> 

<int> 

<dbl> 

<int> 

## 

1 Switzerland 

NA 

NA 

NA 

NA 

NA 

## 

2 Switzerland 

1997-01-01 

14.3 

7089 

17.2 

27675 

## 

3 United Kingdom 1997-01-01 

13.4 

58283 

24.0 

22442 

## 

4 Sweden 

NA 

NA 

8559 

1.90 

18660 

## 

5 Ireland 

2002-01-01 

21.0 

3932 

5.60 

32571 

## 

6 Germany 

1998-01-01 

13.4 

82047 

23.0 

23283 

## 

7 Italy 

NA 

NA 

56719 

18.8 

17430 

## 

8 Italy 

2001-01-01 

17.1 

57894 

19.2 

25359 

## 

9 France 

1998-01-01 

16.5 

58398 

10.6 

24044 

## 10 Spain 

1995-01-01 

27.0 

39223 

7.75 

15720 


Lets’s start by naively graphing some of the data. We can take 
a look at a scatterplot of donors vs year (fig. 5.4). 

P •«- ggplot(data = organdata, mapping = aes(x = year, y = donors)) 
p + geom_point() 

## Warning: Removed 34 rows containing missing values 
## (geom_point). 

A message from ggplot warns you about the missing values. 
We’ll suppress this warning from now on so that it doesn’t clut¬ 
ter the output, but in general it’s wise to read and understand the 
warnings that R gives, even when code appears to run properly. If 
there are a large number of warnings, R will collect them all and 
invite you to view them with the warningsQ function. 

We could use geom_line() to plot each country’s time series, 
like we did with the gapminder data. To do that, remember, we 
need to tell ggplot what the grouping variable is. This time we can 
also facet the figure by country (fig. 5.5), as we do not have too 
many of them. 

P •«- ggplot(data = organdata, mapping = aes(x = year, y = donors)) 
p + geom_line(aes(group = country)) + facet_wrap(~country) 

By default the facets are ordered alphabetically by country. We 
will see how to change this momentarily. 


1996 

Year 


Figure 5.4: Not very informative. 
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Figure 5.5: A faceted lineplot. 


Let’s focus on the country-level variation but without pay¬ 
ing attention to the time trend. We can use geom_boxplot() 
to get a picture of variation by year across countries. Just as 
geom_bar() by default calculates a count of observations by the 
category you map to x, the stat_boxplot() function that works 
with georn_boxplot() will calculate a number of statistics that 
allow the box and whiskers to be drawn. We tell geom_boxplot() 
the variable we want to categorize by (here, country) and the 
continuous variable we want summarized (here, donors). 


p ggplot(data = organdata, mapping = aes(x = country, y = donors)) 
p + geom_boxplot() 

The boxplots in figure 5.6. look interesting, but two issues 
could be addressed. First, as we saw in the previous chapter, it 
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Figure 5.6: A first attempt at boxplots by country. 


Country 


is awkward to have the country names on the x-axis because the 
labels will overlap. For figure 5.7 use coord_flip() again to switch 
the axes (but not the mappings). 

p •<- ggplot(data = organdata, mapping = aes(x = country, y = donors)) 

p + geom_boxplot() + coord_f)ip() 

That’s more legible but still not ideal. We generally want our 
plots to present data in some meaningful order. An obvious way 
is to have the countries listed from high to low average donation 
rate. We accomplish this by reordering the country variable by 
the mean of donors. The reorderQ function will do this for us. 
It takes two required arguments. The first is the categorical vari¬ 
able or factor that we want to reorder. In this case, that’s country. 
The second is the variable we want to reorder it by. Here that is 
the donation rate, donors. The third and optional argument to 
reorderQ is the function you want to use as a summary statistic. 
If you give reorderQ only the first two required arguments, then 
by default it will reorder the categories of your first variable by the 
mean value of the second. You can use any sensible function you 
like to reorder the categorical variable (e.g., median, or sd). There 
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Figure 5.7: Moving the countries to the y-axis. 
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is one additional wrinkle. In R, the default mean function will fail 
with an error if there are missing values in the variable you are try¬ 
ing to take the average of. You must say that it is OK to remove the 
missing values when calculating the mean. This is done by sup¬ 
plying the na. rm=TRUE argument to reorderQ, which internally 
passes that argument on to mean (). We are reordering the variable 
we are mapping to the x aesthetic, so we use reorderQ at that 
point in our code: 



p x- ggplot(data = organdata, mapping = aes(x = reorder(country, 
donors, na.rm = TRUE), y = donors)) 
p + geom_boxplot() + labs(x = NULL) + coord_flip() 

Because it’s obvious what the country names are, in the labs () 
call we set their axis label to empty with labs (x = NULL).Ggplot 
offers some variants on the basic boxplot, including the violin 
plot. Try redoing figure 5.8 with geom_violinQ. There are also 
numerous arguments that control the finer details of the boxes and 
whiskers, including their width. Boxplots can also take color and 
fill aesthetic mappings like other geoms, as in figure 5.9. 

p ggplot(data = organdata, 

mapping = aes(x = reorder(country, donors, na.rm=TRUE), 
y = donors, fill = world)) 
p + geom_boxplot() + labs(x=NULL) + 

coord_fiip() + theme (legend, position = "top") 
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Figure 5.8: Boxplots reordered by mean donation 
rate. 
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Figure 5.9: A boxplot with the fill aesthetic mapped. 


Putting categorical variables on the y-axis to compare their 
distributions is a useful trick. Its makes it easy to effectively present 
summary data on more categories. The plots can be quite compact 
and fit a relatively large number of cases in by row. The approach 
also has the advantage of putting the variable being compared onto 
the x-axis, which sometimes makes it easier to compare across 
categories. If the number of observations within each category 
is relatively small, we can skip (or supplement) the boxplots and 
show the individual observations, too. In figure 5.10 we map the 
world variable to color instead of fill as the default geoni_point() 
plot shape has a color attribute but not a fill. 
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Figure 5.10: Using points instead of a boxplot. 
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Figure 5.11: Jittering the points. 
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p ggplot(data = organdata, 

mapping = aes(x = reorder(country, donors, na.rm=TRUE), 
y = donors, color = world)) 
p + geom_point() + labs(x=NULL) + 

coord_f)ip() + theme (legend, position = "top") 

When we use geom_point() like this, there is some overplot¬ 
ting of observations. In these cases, it can be useful to perturb 
the data a bit in order to get a better sense of how many obser¬ 
vations there are at different values. We use geom_jitter() to 
do this (fig. 5.11). This geom works much like geom_point() but 
randomly nudges each observation by a small amount. 
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Figure 5.12: A jittered plot. 


p •<- ggplot(data = organdata, 

mapping = aes(x = reorder(country, donors, na.rm=TRUE), 
y = donors, color = world)) 
p + geom_jitter() + labs(x=NULL) + 

coord_flip() + theme(legend. position = "top") 


The default amount of jitter is a little too much for our pur¬ 
poses. We can control it using height and width arguments to 
a position_jitter() function within the geom. Because we’re 
making a one-dimensional summary here, we just need width. 

Figure 5.12 shows the data with a more appropriate amount of 

jitter. Can you see why we did not use height? If not, try it 

and see what happens. 

p •<- ggplot(data = organdata, 

mapping = aes(x = reorder(country, donors, na.rm=TRUE), 
y = donors, color = world)) 

p + geonujitter(position = position_jitter(width=0.15)) + 

labs(x=NULL) + coord_ftip() + theme(legend. position = "top") 

When we want to summarize a categorical variable that just 
has one point per category, we should use this approach as well. 

The result will be a Cleveland dotplot, a simple and extremely 
effective method of presenting data that is usually better than 
either a bar chart or a table. For example, we can make a Cleveland 
dotplot of the average donation rate. 

This also gives us another opportunity to do a little bit of 
data munging with a dplyr pipeline. We will use one to aggregate 
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our larger country-year data frame to a smaller table of summary 
statistics by country There is more than one way to use a pipeline 
for this task. We could choose the variables we want to summarize 
and then repeatedly use the mean() and sd() functions to calcu¬ 
late the means and standard deviations of the variables we want. 


by_country organdata l>l group_by(consent_law, country) 
summarize(donors_mean= mean(donors, na.rm = TRUE), 
donors_sd = sd(donors, na.rm = TRUE), 
gdp_mean = mean(gdp, na.rm = TRUE), 
health_mean = mean(health, na.rm = TRUE), 
roads_mean = mean(roads, na.rm = TRUE), 
cerebvas_mean = mean(cerebvas, na.rm = TRUE)) 


The pipeline consists of two steps. First we group the data 
by consent_law and country, and then we use summarize () to 
create six new variables, each one of which is the mean or stan¬ 
dard deviation of each country’s score on a corresponding variable 

For an alternative view, change country to year in in the original OFgandata data frame. 

the grouping statement and see what happens. As usual) the summarize () Step will inherit information about 

the original data and the grouping and then do its calculations at 
the innermost grouping level. In this case it takes all the obser¬ 
vations for each country and calculates the mean or standard 
deviation as requested. Here is what the resulting object looks like: 


by_country 

## # A tibble: 17 x 8 

## # Groups: consent_law [?] 


## 


consent_law 

country 

donors_mean 

donors_sd gdp_mean 

health_mean 

roads_mean 

cerebvas_mean 

## 


<chr> 

<chr> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

## 

1 

Informed 

Australia 

10.6 

1.14 

22179. 

1958. 

105. 

558. 

## 

2 

Informed 

Canada 

14.0 

0.751 

23711. 

2272. 

109. 

422. 

## 

3 

Informed 

Denmark 

13.1 

1.47 

23722. 

2054. 

102. 

641. 

## 

4 

Informed 

Germany 

13.0 

0.611 

22163. 

2349. 

113. 

707. 

## 

5 

Informed 

Ireland 

19.8 

2.48 

20824. 

1480. 

118. 

705. 

## 

6 

Informed 

Netherlands 

13.7 

1.55 

23013. 

1993. 

76.1 

585. 

## 

7 

Informed 

United Kingdom 

13.5 

0.775 

21359. 

1561. 

67.9 

708. 

## 

8 

Informed 

United States 

20.0 

1.33 

29212. 

3988. 

155. 

444. 

## 

9 

Presumed 

Austria 

23.5 

2.42 

23876. 

1875. 

150. 

769. 

## 

10 

Presumed 

Belgium 

21.9 

1.94 

22500. 

1958. 

155. 

594. 
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## 11 Presumed 

Finland 

18.4 

1.53 

21019. 

1615. 

93.6 

771 

## 12 Presumed 

France 

16.8 

1.60 

22603. 

2160. 

156. 

433 

## 13 Presumed 

Italy 

11.1 

4.28 

21554. 

1757. 

122. 

712 

## 14 Presumed 

Norway 

15.4 

1.11 

26448. 

2217. 

70.0 

662 

## 15 Presumed 

Spain 

28.1 

4.96 

16933. 

1289. 

161. 

655 

## 16 Presumed 

Sweden 

13.1 

1.75 

22415. 

1951. 

72.3 

595 

## 17 Presumed 

Switzerland 

14.2 

1.71 

27233. 

2776. 

96.4 

424 


As before, the variables specified in group_by() are retained 
in the new data frame, the variables created with summarize() 
are added, and all the other variables in the original data are 
dropped. The countries are also summarized alphabetically within 
consent_law, which was the outermost grouping variable in the 
group_by() statement at the start of the pipeline. 

Using our pipeline this way is reasonable, but the code is worth 
looking at again. For one thing, we have to repeatedly type out the 
names of the mean () and sd () functions and give each of them the 
name of the variable we want summarized and the na. rm = TRUE 
argument each time to make sure the functions don’t complain 
about missing values. We also repeatedly name our new summary 
variables in the same way, by adding _mean or _sd to the end of 
the original variable name. If we wanted to calculate the mean and 
standard deviation for all the numerical variables in organdata, 
our code would get even longer. Plus, in this version we lose the 
other, time-invariant categorical variables that we haven’t grouped 
by, such as world. When we see repeated actions like this in our 
code, we can ask whether there’s a better way to proceed. 

There is. What we would like to do is apply the mean() and 
sd() functions to every numerical variable in organdata, but 
only the numerical ones. Then we want to name the results in a 
consistent way and return a summary table including all the cate¬ 
gorical variables like world. We can create a better version of the 
by_country object using a little bit of R’s functional programming 
abilities. Here is the code: 

by_country <- organdata group_by(consent_law, country) 
summarize_if(is.numeric, funs(mean, sd), na.rm = TRUE) 
ungroupO 

The pipeline starts off just as before, taking organdata and 
then grouping it by consent_law and country. In the next 
step, though, instead of manually taking the mean and standard 
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We do not have to use parentheses when naming 
the functions inside summarize.if (). 


Sometimes graphing functions can get confused by 
grouped tibbles where we don't explicitly use the 
groups in the plot. 


deviation of a subset of variables, we use the summarize_if() 
function instead. As its name suggests, it examines each column 
in our data and applies a test to it. It only summarizes if the test 
is passed, that is, if it returns a value of TRUE. Here the test is the 
function is. numericQ, which looks to see if a vector is a numeric 
value. If it is, then summarize_if () will apply the summary func¬ 
tion or functions we want to organdata. Because we are taking 
both the mean and the standard deviation, we use funs() to list 
the functions we want. And we finish with the n a.rm = TRUE argu¬ 
ment, which will be passed on to each use of both mean () and sd (). 
In the last step in the pipeline we ungroup () the data, so that the 
result is a plain tibble. 

Here is what the pipeline returns: 


by_country 

## # A tibble: 17 x 28 


## 


consent_law 

country donors_mean 

pop_mean pop_dens_mean 

gdp_mean 

gdp_lag_mean 

health_mean 

## 


<chr> 

<chr> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

## 

1 

Informed 

Australia 

10.6 

18318. 

0.237 

22179. 

21779. 

1958. 

## 

2 

Informed 

Canada 

14.0 

29608. 

0.297 

23711. 

23353. 

2272. 

## 

3 

Informed 

Denmark 

13.1 

5257. 

12.2 

23722. 

23275. 

2054. 

## 

4 

Informed 

Germany 

13.0 

80255. 

22.5 

22163. 

21938. 

2349. 

## 

5 

Informed 

Ireland 

19.8 

3674. 

5.23 

20824. 

20154. 

1480. 

## 

6 

Informed 

Netherlands 

13.7 

15548. 

37.4 

23013. 

22554. 

1993. 

## 

7 

Informed 

United Kingdom 

13.5 

58187. 

24.0 

21359. 

20962. 

1561. 

## 

8 

Informed 

United States 

20.0 

269330. 

2.80 

29212. 

28699. 

3988. 

## 

9 

Presumed 

Austria 

23.5 

7927. 

9.45 

23876. 

23415. 

1875. 

## 

10 

Presumed 

Belgium 

21.9 

10153. 

30.7 

22500. 

22096. 

1958. 

## 

11 

Presumed 

Finland 

18.4 

5112. 

1.51 

21019. 

20763. 

1615. 

## 

12 

Presumed 

France 

16.8 

58056. 

10.5 

22603. 

22211. 

2160. 

## 

13 

Presumed 

Italy 

11.1 

57360. 

19.0 

21554. 

21195. 

1757. 

## 

14 

Presumed 

Norway 

15.4 

4386. 

1.35 

26448. 

25769. 

2217. 

## 

15 

Presumed 

Spain 

28.1 

39666. 

7.84 

16933. 

16584. 

1289. 

## 

16 

Presumed 

Sweden 

13.1 

8789. 

1.95 

22415. 

22094. 

1951. 

## 

17 

Presumed 

Switzerland 

14.2 

7037. 

17.0 

27233. 

26931. 

2776. 

## 

# . 

.. with 20 

more variables: health_lag_mean <dbl>, 

pubhealth_mean <dbl>, 

, roads_mean ■ 

<dbl>, 

## 

# 

cerebvas_mean <dbl>, assault_mean <dbl>, external 

_mean <dbl>, 

txp_pop_ 

.mean <dbl>, 


## 

# 

donors_sd 

<dbl>, pop_sd <dbl>, 

pop_dens_sd <dbl>, 

gdp_sd <dbl 

>, gdp_lag_sd <dbl>, 


## 

# 

health_sd 

<dbl>, health_lag_sd 

<dbl>, 

pubhealth_sd <dbl>, roads_sd <dbl>, cerebvas. 

_sd <dbl>, 

## 

# 

assault_sd 

<dbl>, external_sd 

<dbl>, txp_pop_sd <dbl> 
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Figure 5.13: A Cleveland dotplot, with colored 
points. 


All the numeric variables have been summarized. They are 
named using the original variable, with the functions name 
appended: donors.mean and donors.sd, and so on. This is a com¬ 
pact way to rapidly transform our data in various ways. There is 
a family of summarize, functions for various tasks, and a com¬ 
plementary group of mutate, functions for when we want to add 
columns to the data rather than aggregate it. 

With our data summarized by country, for figure 5.13 we can 
draw a dotplot with geom.pointQ. Let’s also color the results by 
the consent law for each country. 

p <- ggplot(data = by.country, 

mapping = aes(x = donors.mean, y = reorder(country, donors.mean), 
color = consent.law)) 

p + geom_point(size=3) + 

labs(x = "Donor Procurement Rate", 
y = color = "Consent Law") + 
theme(legend.position="top") 
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Alternatively, if we liked, we could use a facet instead of color¬ 
ing the points. Using facet_wrap(), we can split the consent_law 
variable into two panels and then rank the countries by donation 
rate within each panel. Because we have a categorical variable on 
our y-axis, there are two wrinkles worth noting. First, if we leave 
f acet_w rap () to its defaults, the panels will be plotted side by side. 
This will make it difficult to compare the two groups on the same 
scale. Instead the plot will be read left to right, which is not use¬ 
ful. To avoid this, we will have the panels appear one on top of 
the other by saying we want to have only one column. This is the 
ncol=1 argument. Second, and again because we have a categorical 
variable on the y-axis, the default facet plot will have the names of 
every country appear on the y-axis of both panels. (Were the y-axis 
a continuous variable, this would be what we would want.) In that 
case, only half the rows in each panel of our plot will have points 
in them. 

To avoid this we allow the y-axes scale to be free. This is 
the scales = "f ree_y" argument. Again, for faceted plots where 
both variables are continuous, we generally do not want the scales 
to be free, because it allows the x- or y-axis for each panel to vary 
with the range of the data inside that panel only, instead of the 
range across the whole dataset. Ordinarily, the point of small- 
multiple facets is to be able to compare across the panels. This 
means free scales are usually not a good idea, because each panel 
gets its own x- or y-axis range, which breaks comparability. But 
where one axis is categorical, as in figure 5.14, we can free the 
categorical axis and leave the continuous one fixed. The result is 
that each panel shares the same x-axis, and it is easy to compare 
them. 


p ggplot(data = by_country, 

mapping = aes(x = donors_mean, 

y = reorder(country, donors_mean))) 

p + geom_point(size=3) + 

facet_wrap(~ consent_law, scales = "free_y", ncol = 1) + 
labs(x= "Donor Procurement Rate", 

y= "") 


Figure 5.14: A faceted dotplot with free scales on 
the y-axis. 


Cleveland dotplots are generally preferred to bar or column 
charts. When making them, put the categories on the y-axis and 
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order them in the way that is most relevant to the numerical 
summary you are providing. This sort of plot is also an excellent 
way to summarize model results or any data with with error ranges. 
We use geom_point() to draw our dotplots. There is a geom called 
geom_dotplot(), but it is designed to produce a different sort of 
figure. It is a kind of histogram, with individual observations rep¬ 
resented by dots that are then stacked on top of one another to 
show how many of them there are. 

The Cleveland-style dotplot can be extended to cases where we 
want to include some information about variance or error in the 
plot. Using geom_pointrange(), we can tell ggplot to show us a 
point estimate and a range around it. Here we will use the standard 
deviation of the donation rate that we calculated above. But this is 
also the natural way to present, for example, estimates of model 
coefficients with confidence intervals. With geom_pointrange() 
we map our x and y variables as usual, but the function needs a lit¬ 
tle more information than geom_point. It needs to know the range 
of the line to draw on either side of the point, defined by the argu¬ 
ments ymax and ymin. This is given by the y value (donors_mean) 
plus or minus its standard deviation (donors_sd). If a function 
argument expects a number, it is OK to give it a mathematical 
expression that resolves to the number you want. R will calculate 
the result for you. 


p ggplotfdata = by_country, mapping = aes(x = reorderfcountry, 
donors_mean), y = donors_mean)) 

p + geom_pointrange(mapping = aesfymin = donors_mean - donors_sd, 
ymax = donors_mean + donors_sd)) + 
labs(x= y= "Donor Procurement Rate") + coord_ftip() 

Because geom_pointrange() expects y, ymin, and ymax as 
arguments, in figure 5.15 we map donors_mean to y and the ccode 
variable to x, then flip the axes at the end with coord_flip(). 


5.3 Plot Text Directly 

It can sometimes be useful to plot the labels along with the points 
in a scatterplot, or just plot informative labels directly. We can do 
this with geom_text(). 
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Figure 5.15: A dot-and-whisker plot, with the range 
defined by the standard deviation of the measured 
variable. 



p •<- ggplot(data = by_country, mapping = aes(x = roads_mean, 
y = donors_mean)) 

p + geom_point() + geom_text(mapping = aes(label = country)) 


The text in figure 5.16 is plotted right on top of the points 
because both are positioned using the same x and y mapping. One 
way of dealing with this, often the most effective if we are not too 
worried about excessive precision in the graph, is to remove the 
points by dropping geom_point() from the plot. A second option 
is to adjust the position of the text. We can left- or right-justify the 
labels using the hjust argument to geom_text(). Setting hjust = 
0 will left-justify the label, and hj ust=1 will right-justify it. 

P <- ggplotfdata = by_country, 

mapping = aes(x = roads_mean, y = donors_mean)) 

p + geom_point() + geom_text(mapping = aesflabel = country), hjust = 0) 

You might be tempted to try different values to hj ust to fine- 
tune the labels, in figure 5.17, but this is not a robust approach. 
It will often fail because the space is added in proportion to the 
length of the label. The result is that longer labels move further 
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Figure 5.16: Plotting labels and text. 
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Figure 5.17: Plot points and text labels, with a 
horizontal position adjustment. 




away from their points than you want. There are ways around this, 
but they introduce other problems. 

Instead of wrestling any further with geom_text(), we will use 
ggrepel. This useful package adds new geoms to ggplot. Just as 
ggplot extends the plotting capabilities of R, there are many 
small packages that extend the capabilities of ggplot, often by 
providing some new type of geom. The ggrepel package pro¬ 
vides geom_text_repel() and geom_label_repel(), two geoms 
that can pick out labels much more flexibly than the default 
geom_text(). First, make sure the library is installed, then load 
it in the usual way: 


library(ggrepel) 


We will use geom_text_repel() instead of geom_text(). To 
demonstrate some of what geom_text_repel () can do, we will 
switch datasets and work with some historical U.S. presidential 
election data provided in the socviz library. 

elections_historic select(2:7) 


# A tibble: 49 x 6 


year winner win_party ec_pct popular_pct popular_margin 



A 

3 

c+ 

V 

<chr> 

<chr> 

<dbl> 

<dbl> 

<dbl> 

1 

1824 

John Quincy Adams 

D.-R. 

0.322 

0.309 

-0.104 

2 

1828 

Andrew Jackson 

Dem. 

0.682 

0.559 

0.122 

3 

1832 

Andrew Jackson 

Dem. 

0.766 

0.547 

0.178 

4 

1836 

Martin Van Buren 

Dem. 

0.578 

0.508 

0.142 

5 

1840 

William Henry Harrison Whig 

0.796 

0.529 

0.0605 

6 

1844 

James Polk 

Dem. 

0.618 

0.495 

0.0145 

7 

1848 

Zachary Taylor 

Whig 

0.562 

0.473 

0.0479 

8 

1852 

Franklin Pierce 

Dem. 

0.858 

0.508 

0.0695 

9 

1856 

James Buchanan 

Dem. 

0.588 

0.453 

0.122 

10 

1860 

Abraham Lincoln 

Rep. 

0.594 

0.396 

0.101 


# ... with 39 more rows 


p.title «- "Presidential Elections: Popular & Electoral College Margins" 

p.subtitle "1824-2016" 

p_caption <- "Data for 2016 are provisional." 
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x_label <- "Winner's share of Popular Vote" 

y_label <- "Winner's share of Electoral College Votes" 

P *- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, 

label = winner_label)) 

p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") + 
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") + 
geom_point() + 
geom_text_repel() + 

scale_x_continuous(labels = scales::percent) + 
scale_y_continuous(labels = scales::percent) + 

labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, 
caption = p_caption) 

Figure 5.18 takes each U.S. presidential election since 1824 
(the first year that the size of the popular vote was recorded) and 
plots the winner’s share of the popular vote against the winner’s 
share of the electoral college vote. The shares are stored in the 
data as proportions (from 0 to 1) rather than percentages, so we 
need to adjust the labels ofthe scales using scale_x_.continuous() 
and scale_y_continuous(). Seeing as we are interested in par¬ 
ticular presidencies, we also want to label the points. But because Normally it is not a good idea to label every point on 

many of the data points are plotted quite close together, we need to a plot in the way we do here A better a PP roach 

might be to select a few points of particular interest. 

make sure the labels do not overlap, or obscure other points. The 
geom_text_ repel () function handles the problem very well. This 
plot has relatively long titles. We could put them directly in the 
code, but to keep things tidier we assign the text to some named 
objects instead. Then we use those in the plot formula. 

In this plot, what is of interest about any particular point is 
the quadrant of the x-y plane each point is in, and how far away 
it is from the 50 percent threshold on both the x-axis (with the 
popular vote share) and the y-axis (with the Electoral College vote 
share). To underscore this point we draw two reference lines at the 
50 percent line in each direction. They are drawn at the begin¬ 
ning of the plotting process so that the points and labels can be 
layered on top of them. We use two new geoms, geom_hline() 
and geom_vline(), to make the lines. They take yintercept and 
xintercept arguments, respectively, and the lines can also be sized 
and colored as you please. There is also a geom_abline() geom 
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Figure 5.18: Text labels with ggrepel. 


that draws straight lines based on a supplied slope and intercept. 
This is useful for plotting, for example, 45 degree reference lines 
in scatterplots. 

The ggrepel package has several other useful geoms and 
options to aid with effectively plotting labels along with points. 
The performance of its labeling algorithm is consistently very 
good. For most purposes it will be a better first choice than 
geom_text(). 
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5.4 Label Outliers 

Sometimes we want to pick out some points of interest in the data 
without labeling every single item. We can still use geom_textQ 
or geom_text_repelQ. We just need to select the points we want 
to label. We do this by telling geom_text_repelQ to use a dif¬ 
ferent dataset from the one geom_point() is using. The subsetQ 
function does the work. 


p ggplot(data = by_country, 

mapping = aes(x = gdp_mean, y = health_mean)) 

p + geom_point() + 

geom_text_repel(data = subset(by_country, gdp_mean > 25000), 
mapping = aes(label = country)) 

p <- ggplot(data = by_country, 

mapping = aes(x = gdp_mean, y = health_mean)) 

p + geom_point() + 

geom_text_repel(data = subset(by_country, 

gdp_mean > 25000 | health_mean < 1500 | 
country /.in/. "Belgium"), 
mapping = aes(label = country)) 


In the top part of figure 5.19, we specify a new data argument 
to the text geom and use subsetQ to create a small dataset on the 
fly. The subsetQ function takes the by_country object and selects 
only the cases where gdp_mean is over 25,000, with the result that 
only those points are labeled in the plot. The criteria we use can be 
whatever we like, as long as we can write a logical expression that 
defines it. For example, in the lower part of the figure we pick out 
cases where gdp_mean is greater than 25,000, or health_mean is less 
than 1,500, or the country is Belgium. In all these plots, because we 
are using geom_text_repelQ, we no longer have to worry about 
our earlier problem where the country labels were clipped at the 
edge of the plot. 

Alternatively, we can pick out specific points by creating a 
dummy variable in the data set just for this purpose. For figure 5.20 
we add a column to organdata called ind. An observation gets 
coded as TRUE if ccode is “Ita” or “Spa,” and if the year is greater 


Figure 5.19: Top: Labeling text according to a single 
criterion. Bottom: Labeling according to several 
criteria. 
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Figure 5.20: Labeling using a dummy variable. 


than 1998. We use this new ind variable in two ways in the plot¬ 
ting code. First, we map it to the color aesthetic in the usual way. 
Second, we use it to subset the data that the text geom will label. 
Then we suppress the legend that would otherwise appear for the 
label and color aesthetics by using the guidesQ function. 

organdata$ind organdata$ccode 'l.in'l. c("Ita", "Spa") & 
organdata$year > 1998 

p <- ggplot(data = organdata, 

mapping = aes(x = roads, 

y = donors, color = ind)) 

p + geom_point() + 

geom_text_repel(data = subset(organdata, ind), 
mapping = aes(label = ccode)) + 
guides(label = FALSE, color = FALSE) 
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5.5 Write and Draw in the Plot Area 

Sometimes we want to annotate the figure directly. Maybe we need 
to point out something important that is not mapped to a vari¬ 
able. We use annotate () for this purpose. It isn’t quite a geom, as 
it doesn’t accept any variable mappings from our data. Instead, it 
can use geoms, temporarily taking advantage of their features in 
order to place something on the plot. The most obvious use-case 
is putting arbitrary text on the plot (fig. 5.21). 

We will tell annotate () to use a text geom. It hands the plot¬ 
ting duties to geom_text(), which means that we can use all of 
that geom’s arguments in the annotate () call. This includes the x, 
y, and label arguments, as one would expect, but also things like 
size, color, and the hjust and vjust settings that allow text to 
be justified. This is particularly useful when our label has several 
lines in it. We include extra lines by using the special “newline” 
code, \n, which we use instead of a space to force a line-break as 
needed. 

p ggplot(data = organdata, mapping = aes(x = roads, y = donors)) 
p + geom_point() + annotate(geom = "text", x = 91, y = 33, 

label = "A surprisingly high \n recovery rate.", 
hjust = 0) 

The annotateQ function can work with other geoms, too. Use 
it to draw rectangles, line segments, and arrows. Just remember to 
pass along the right arguments to the geom you use. We can add a 
rectangle to this plot, (fig 5.22), for instance, with a second call to 
the function. 

p ggplot(data = organdata, 

mapping = aes(x = roads, y = donors)) 
p + geom_point() + 

annotate(geom = "rect", xmin =125, xmax =155, 

ymin = 30, ymax = 35, fill = "red", alpha = 0.2) + 
annotate(geom = "text", x = 157, y = 33, 

label = "A surprisingly high \n recovery rate.", hjust = 0) 
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Figure 5.21: Arbitrary text with annotateQ. 


5.6 Understanding Scales, Guides, and Themes 

This chapter has gradually extended our ggplot vocabulary in two 
ways. First, we introduced some new geom_ functions that allowed 
us to draw new kinds of plots. Second, we made use of new func¬ 
tions controlling some aspects of the appearance of our graph. 
We used scale_x_log10(), scale_x_continuous(), and other 
scale_ functions to adjust axis labels. We used the guides () func¬ 
tion to remove the legends for a color mapping and a label 
mapping. And we also used the theme () function to move the 
position of a legend from the side to the top of a figure. 

Learning about new geoms extended what we have seen already. 
Each geom makes a different type of plot. Different plots require 
different mappings in order to work, and so each geom_ function 
takes mappings tailored to the kind of graph it draws. You can’t use 
geom_point() to make a scatterplot without supplying an x and a y 
mapping, for example. Using geom_histogram() only requires you 
to supply an x mapping. Similarly, geom_pointrange() requires 
ymin and ymax mappings in order to know where to draw the 
line ranges it makes. A geom_ function may take optional argu¬ 
ments, too. When using geom_boxplot() you can specify what 
the outliers look like using arguments like outlier. shape and 
outlier.color. 
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Figure 5.22: Using two different geoms with 
annotate(). 
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The second kind of extension introduced some new func¬ 
tions, and with them some new concepts. What are the differences 
between the scale_ functions, the guides() function, and the 
theme () function? When do you know to use one rather than the 
other? Why are there so many scale_ functions listed in the online 
help, anyway? How can you tell which one you need? 

Here is a rough starting point: 

. Every aesthetic mapping has a scale. If you want to adjust 
how that scale is marked or graduated, then you use a scale_ 
function. 

. Many scales come with a legend or key to help the reader inter¬ 
pret the graph. These are called guides. You can make adjust¬ 
ments to them with the guides () function. Perhaps the most 
common use case is to make the legend disappear, as some¬ 
times it is superfluous. Another is to adjust the arrangement of 
the key in legends and color bars. 

. Graphs have other features not strictly connected to the logi¬ 
cal structure of the data being displayed. These include things 
like their background color, the typeface used for labels, or the 
placement of the legend on the graph. To adjust these, use the 
theme () function. 

Consistent with ggplot’s overall approach, adjusting some visi¬ 
ble feature of the graph means first thinking about the relationship 
the feature has with the underlying data. Roughly speaking, if the 
change you want to make will affect the substantive interpretation 
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of any particular geom, then most likely you will either be mapping 
an aesthetic to a variable using that geom’s aes() function or be 
specifying a change via some scale_ function. If the change you 
want to make does not affect the interpretation of a given geom_, 
then most likely you will either be setting a variable inside the 
geom_ function, or making a cosmetic change via the theme() 
function. 

Scales and guides are closely connected, which can make 
things confusing. The guide provides information about the scale, 
such as in a legend or color bar. Thus it is possible to make adjust¬ 
ments to guides from inside the various scale_ functions. More 
often it is easier to use the guides () function directly. 


p <- ggplot(data = organdata, 

mapping = aes(x = roads, 
y = donors, 
color = world)) 

p + geom_point() 


Figure 5.23 shows a plot with three aesthetic mappings. The 
variable roads is mapped to X; donors is mapped to y; and world is 
mapped to color. The x and y scales are both continuous, running 
smoothly from just under the lowest value of the variable to just 
over the highest value. Various labeled tick-marks orient the reader 
to the values on each axis. The color mapping also has a scale. The 
world measure is an unordered categorical variable, so its scale is 
discrete. It takes one of four values, each represented by a different 
color. 

Along with color, mappings like fill, shape, and size will 
have scales that we might want to customize or adjust. We could 
have mapped world to shape instead of color. In that case our 
four-category variable would have a scale consisting of four dif¬ 
ferent shapes. Scales for these mappings may have labels, axis 
tick-marks at particular positions, or specific colors or shapes. If 
we want to adjust them, we use one of the scale_ functions. 

Many different kinds of variable can be mapped. More often 
than not, x and y are continuous measures. But they might also 
easily be discrete, as when we mapped country names to the y- 
axis in our boxplots and dotplots. An x or y mapping can also be 
defined as a transformation onto a log scale, or as a special sort 
of number value like a date. Similarly, a color or a fill mapping 
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Figure 5.23: Every mapped variable has a scale. 


scale_<MAPPING>_<KIND>() 

Figure 5.24: A schema for naming the scale 
functions. 



World 

• Corporatist 

• Liberal 

• SocDem 

• NA 


can be discrete and unordered, as with our world variable, or dis¬ 
crete and ordered, as with letter grades in an exam. A color or 
fill mapping can also be a continuous quantity, represented as a 
gradient running smoothly from a low to a high value. Finally, 
both continuous gradients and ordered discrete values might have 
some defined neutral midpoint with extremes diverging in both 
directions. 

Because we have several potential mappings, and each map¬ 
ping might be to one of several different scales, we end up with a lot 
of individual scale_ functions. Each deals with one combination 
of mapping and scale. They are named according to a consistent 
logic, shown in figure 5.24. First comes the scale_ name, then 
the mapping it applies to, and finally the kind of value the scale 
will display. Thus the scale_x_continuous() function controls 
x scales for continuous variables; scale_y_discrete() adjusts y 
scales for discrete variables; and scale_x_log10() transforms an 
x mapping to a log scale. Most of the time, ggplot will guess cor¬ 
rectly what sort of scale is needed for your mapping. Then it will 
work out some default features of the scale (such as its labels and 
where the tick-marks go). In many cases you will not need to make 
any scale adjustments. If x is mapped to a continuous variable, then 
adding + scale_x_continuous() to your plot statement with no 
further arguments will have no effect. It is already there implicitly. 
Adding + scale_x_log10(), on the other hand, will transform 
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Figure 5.25: Making some scale adjustments. 


your scale, as now you have replaced the default treatment of a 
continuous x variable. 

If you want to adjust the labels or tick-marks on a scale, you 
will need to know which mapping it is for and what sort of scale it 
is. Then you supply the arguments to the appropriate scale func¬ 
tion. For example, we can change the x-axis of the previous plot 
to a log scale and then also change the position and labels of the 
tick-marks on the y-axis (fig. 5.25). 

p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors, 
color = world)) 

p + geom_point() + scale_x_log10() + scale_y_continuous(breaks = c(5, 
15, 25), labels = c("Five", "Fifteen", "Twenty Five")) 

The same applies to mappings like color and fill (see 
fig. 5.26). Here the available scale_ functions include ones that 
deal with continuous, diverging, and discrete variables, as well as 
others that we will encounter later when we discuss the use of color 
and color palettes in more detail. When working with a scale that 
produces a legend, we can also use its scale_ function to specify 
the labels in the key. To change the title of the legend, however, we 
use the labsQ function, which lets us label all the mappings. 

P ■*- ggplot(data = organdata, mapping = aes(x = roads, y = donors, 
color = world)) 
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Figure 5.26: Relabeling via a scale function. 



Welfare state 

• Corporatist 

• Liberal 
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• Unclassified 


p + geom_point() + scale_color_discrete(labels = c("Corporatist", 

"Liberal", "Social Democratic", "Unclassified")) + labs(x = "Road Deaths", 
y = "Donor Procurement", color = "Welfare State") 


If we want to move the legend somewhere else on the 
plot, we are making a purely cosmetic decision and that is the 
job of the theme () function. As we have already seen, adding 
+ theme(legend.position = "top") will move the legend as 
instructed. Finally, to make the legend disappear altogether 
(fig. 5.27), we tell ggplot that we do not want a guide for that scale. 
This is generally not good practice, but there can be reasons to do 
it. We already saw an example in figure 4.9. 


p ggplot(data = organdata, mapping = aes(x = roads, y = donors, 
color = world)) 

p + geom_point() + labs(x = "Road Deaths", y = "Donor Procurement") + 
guides(color = FALSE) 

We will look more closely at scale, and theme () functions in 
chapter 8, when we discuss how to polish plots that we are ready 
to display or publish. Until then, we will use scale, functions 
fairly regularly to make small adjustments to the labels and axes 
of our graphs. And we will occasionally use the theme () function 
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Figure 5.27: Removing the guide to a scale. 


to make some cosmetic adjustments. So you do not need to 
worry about additional details of how they work until later on. 
But at this point it is worth knowing what scale_ functions are 
for, and the logic behind their naming scheme. Understanding 
the scale_<mapping>_<kind>() rule makes it easier to see what 
is going on when one of these functions is called to make an 
adjustment to a plot. 

5.7 Where to Go Next 

We covered several new functions and data aggregation techniques 
in this chapter. You should practice working with them. 

. The subsetQ function is very useful when used in conjunc¬ 
tion with a series of layered geoms. Go back to your code for 
the presidential elections plot (fig. 5.18) and redo it so that it 
shows all the data points but only labels elections since 1992. 
You might need to look again at the elections_historic data 
to see what variables are available to you. You can also experi¬ 
ment with subsetting by political party, or changing the colors 
of the points to reflect the winning party. 

. Usegeom_point() and reorderQ to make a Cleveland dotplot 
of all presidential elections, ordered by share of the popular 
vote. 
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Figure 5.28: Two figures from chapter 1. 



60 70 80 

Life expectancy in years, 2007 



Try using annotate () to add a rectangle that lightly colors the 
entire upper left quadrant of figure 5.18. 

The main action verbs in the dplyr library are group_by(), 
filter(), select(), summarize(), and mutate(). Practice 
with them by revisiting the gapminder data to see if you can 
reproduce a pair of graphs from chapter 1, shown here again in 
figure 5.28. You will need to filter some rows, group the data by 
continent, and calculate the mean life expectancy by continent 
before beginning the plotting process. 

Get comfortable with grouping, mutating, and summarizing 
data in pipelines. This will become a routine task as you work 
with your data. There are many ways that tables can be aggre¬ 
gated and transformed. Remember, group_by() groups your 
data from left to right, with the rightmost or innermost group 
being the level calculations will be done at; mutate () adds 
a column at the current level of grouping; and summarizeQ 
aggregates to the next level up. Try creating some grouped 
objects from the GSS data, calculating frequencies as you 
learned in this chapter, and then check to see if the totals are 
what you expect. For example, start by grouping degree by 
race, like this: 


gss_sm '/•>'/. group_by(race, degree) '/•>'/. summarize(N = n()) 
mutate(pct = round(N/sum(N) * 100, 0)) 


## # A tibble: 18 x 4 
## # Groups: race [3] 


## 


race degree 

N 

pet 

## 


<fct> <fct> 

A 

=3 

C+ 

V 

<dbl> 

## 

1 

White Lt High School 

197 

9. 

## 

2 

White High School 

1057 

50. 

## 

3 

White Junior College 

166 

8. 

## 

4 

White Bachelor 

426 

20. 

## 

5 

White Graduate 

250 

12. 

## 

6 

White <NA> 

4 

0. 

## 

7 

Black Lt High School 

60 

12. 

## 

8 

Black High School 

292 

60. 

## 

9 

Black Junior College 

33 

7. 

## 

10 

Black Bachelor 

71 

14. 

## 

11 

Black Graduate 

31 

6. 
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## 12 Black <NA> 

3 

1 . 

## 13 Other Lt High School 

71 

26. 

## 14 Other High School 

112 

40. 

## 15 Other Junior College 

17 

6. 

## 16 Other Bachelor 

39 

14. 

## 17 Other Graduate 

37 

13. 

## 18 Other <NA> 

1 

0. 


. This code is similar to what you saw earlier but more com¬ 
pact. (We calculate the pet values directly) Check that the 
results are as you expect by grouping by race and summing 
the percentages. Try doing the same exercise grouping by sex 
or region. 

. Try summary calculations with functions other than sum. Can 
you calculate the mean and median number of children by 
degree? (Hint: the childs variable in gss_sm has children as a 
numeric value.) 

. dplyr has a large number of helper functions that let you 
summarize data in many different ways. The vignette on win¬ 
dow functions included with the dplyr documentation is a 
good place to begin learning about these. You should also 
look at chapter 3 of Wickham & Grolemund (2016) for more 
information on transforming data with dplyr. 

. Experiment with the gapminder data to practice some of the 
new geoms we have learned. Try examining population or life 
expectancy over time using a series of boxplots. (Hint: you may 
need to use the group aesthetic in the aes () call.) Can you facet 
this boxplot by continent? Is anything different if you create a 
tibble from gapminder that explicitly groups the data by year 
and continent first, and then create your plots with that? 

• Read the help page for geom_boxplot() and take a look at the 
notch and varwidth options. Try them out to see how they 
change the look of the plot. 

• As an alternative to geom_boxplot(), try geom_violin() fora 
similar plot but with a mirrored density distribution instead of 
a box and whiskers. 

• geom_pointrange() is one of a family of related geoms that 
produce different kinds of error bars and ranges, depend¬ 
ing on your specific needs. They include geom_linerange(), 
geom_crossbar(), and geom_errorbar(). Try them out using 
gapminder or organdata to see how they differ. 




6 Work with Models 


Data visualization is about more than generating figures that dis¬ 
play the raw numbers from a table of data. Right from the begin¬ 
ning, it involves summarizing or transforming parts of the data 
and then plotting the results. Statistical models are a central part of 
that process. In this chapter, we will begin by looking briefly at how 
ggplot can use various modeling techniques directly within geoms. 
Then we will see how to use the broom and margins libraries to 
tidily extract and plot estimates from models that we fit ourselves. 

p ggplot(data = gapminder, 

mapping = aes(x = log(gdpPercap), y = lifeExp)) 

p + geom_point(alpha=0.1) + 

geom_smooth(color = "tomato", fill="tomato", method = MASS::rim) + 

geom_smooth(color = "steelblue", fill="steelblue", method = "lm") 

p + geom_point(alpha=0.l) + 

geom_smooth(color = "tomato", method = "lm", size = 1.2, 
formula = y ~ splines::bs(x, 3), se = FALSE) 

p + geom_point(alpha=0.l) + 

geom_quantile(color = "tomato", size =1.2, method = "rqss", 
lambda = 1, quantiles = c(0.20, 0.5, 0.85)) 


Histograms, density plots, boxplots, and other geoms compute 
either single numbers or new variables before plotting them. As we 
saw in section 4.4, these calculations are done by stat_ functions, 
each of which works hand in hand with its default geom_ func¬ 
tion, and vice versa. Moreover, from the smoothing lines we drew 
in almost the very first plots we made, we have seen that stat_ 
functions can do a fair amount of calculation and even model esti¬ 
mation directly. The geom_smooth() function can take a range of 
method arguments to fit LOESS, OLS, and robust regression lines, 
among others. 
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Both the geom_smooth() andgeom_quantile() functions can 
also be instructed to use different formulas to produce their fits. In 
the top panel offigure 6.1, we access the MASS library’s rim function 
to fit a robust regression line. In the second panel, the bs function 
is invoked directly from the splines library in the same way, to 
fit a polynominal curve to the data. This is the same approach to 
directly accessing functions without loading a whole library that 
we have already used several times when using functions from the 
scales package. The geom_quantile() function, meanwhile, is 
like a specialized version of geom_smooth() that can fit quantile 
regression lines using a variety of methods. The quantiles argu¬ 
ment takes a vector specifying the quantiles at which to fit the lines. 


6.1 Show Several Fits at Once, with a Legend 

As we just saw in the first panel of figure 6.1, where we plotted 
both an OLS and a robust regression line, we can look at several 
fits at once on the same plot by layering on new smoothers with 
geom_smooth(). As long as we set the color and fill aesthetics to 
different values for each fit, we can easily distinguish them visu¬ 
ally. However, ggplot will not draw a legend that guides us about 
which fit is which. This is because the smoothers are not logically 
connected to one another. They exist as separate layers. What if we 
are comparing several different fits and want a legend describing 
them? 

As it turns out, geom_smooth() can do this via the slightly 
unusual route of mapping the color and fill aesthetics to 
a string describing the model we are fitting and then using 
scale_color_manual() and scale_fill_manual() to create the 
legend (fig.6.2). First we use brewer.pal() from the RColor- 
Brewer library to extract three qualitatively different colors from 
a larger palette. The colors are represented as hex values. As before 
use the :: convention to use the function without loading the 
whole package: 

model_colors <- RColorBrewer::brewer.pal(3, "Setl") 

model_colors 


## [1] "#E41A1C" "#377EB8" "#4DAF4A" 





Figure 6.1: From top to bottom: an OLS vs robust 
regression comparison; a polynomial fit; and 
quantile regression. 
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Then we create a plot with three different smoothers, map¬ 
ping the color and fill within the aes() function as the name of the 
smoother: 


Models | Cubic spline | LOESS H OLS 



Figure 6.2: Fitting smoothers with a legend. 


p0 •<- ggplot(data = gapminder, 

mapping = aes(x = log(gdpPercap), y = lifeExp)) 

pi <- p0 + geom_point(alpha = 0.2) + 

geoin_smooth(method = "lm", aesfcolor = "OLS", fill = "OLS")) + 
geom_smooth(method = "lm", formula = y ~ splines::bs(x, df = 3), 
aes(color = "Cubic Spline", fill = "Cubic Spline")) + 
geom_smooth(method = "loess", 

aes(color = "LOESS", fill = "LOESS")) 

pi + scale_color_manual(name = "Models", values = model_colors) + 
scale_flll_manual(name = "Models", values = model_colors) + 
theme(legend.position = "top") 


In a way we have cheated a little here to make the plot work. 
Until now, we have always mapped aesthetics to the names of vari¬ 
ables, not to strings like “OLS” or “Cubic Splines.” In chapter 3, 
when we discussed mapping versus setting aesthetics, we saw what 
happened when we tried to change the color of the points in a scat- 
terplot by setting them to “purple” inside the aes() function. The 
result was that the points turned red instead, as ggplot in effect 
created a new variable and labeled it with the word “purple.” We 
learned there that the aes() function was for mapping variables 
to aesthetics. 

Here we take advantage of that behavior, creating a new single¬ 
value variable for the name of each of our models. Ggplot will prop¬ 
erly construct the relevant guide ifwe call scale_color_manual() 
Remember that we have to call two scale functions and SCale_fill_lTianual (). The result is a single plot containing 

because we have two mappings, color and fill. no t just our three smoothers but also an appropriate legend to 

guide the reader. 

These model-fitting features make ggplot very useful for 
exploratory work and make it straightforward to generate and 
compare model-based trends and other summaries as part of 
the process of descriptive data visualization. The various stat_ 
functions are a flexible way to add summary estimates of various 



kinds to plots. But we will also want more than this, including 
presenting results from models we fit ourselves. 


6.2 Look Inside Model Objects 

Covering the details of fitting statistical models in R is beyond 
the scope of this book. For a comprehensive, modern introduc¬ 
tion to that topic you should work your way through Gelman & 
Hill (2018). Harrell (2016) is also good on the many practical con¬ 
nections between modeling and graphing data. Similarly, Gelman 
(2004) provides a detailed discussion of the use of graphics as a 
tool in model checking and validation. Here we will discuss some 
ways to take the models that you fit and extract information that 
is easy to work with in ggplot. Our goal, as always, is to get from 
however the object is stored to a tidy table of numbers that we can 
plot. Most classes of statistical model in R will contain the infor¬ 
mation we need or will have a special set of functions, or methods, 
designed to extract it. 

We can start by learning a little more about how the output 
of models is stored in R. Remember, we are always working with 
objects, and objects have an internal structure consisting of named 
pieces. Sometimes these are single numbers, sometimes vectors, 
and sometimes lists of things like vectors, matrices, or formulas. 

We have been working extensively with tibbles and data 
frames. These store tables of data with named columns, perhaps 
consisting of different classes of variable, such as integers, charac¬ 
ters, dates, or factors. Model objects are a little more complicated 
again. 


gapminder 


t# # A tibble: 1,704 x 6 


## 

country continent 

year lifeExp 

pop gdpPercap 

## 

<fct> <fct> 

<int> 

<dbl> 

<int> 

<dbl> 

## 

1 Afghanistan Asia 

1952 

28.8 

8425333 

779. 

## 

2 Afghanistan Asia 

1957 

30.3 

9240934 

821. 

## 

3 Afghanistan Asia 

1962 

32.0 

10267083 

853. 

## 

4 Afghanistan Asia 

1967 

34.0 

11537966 

836. 

## 

5 Afghanistan Asia 

1972 

36.1 

13079460 

740. 

## 

6 Afghanistan Asia 

1977 

38.4 

14880372 

786. 
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## 

7 Afghanistan Asia 

1982 

39.9 12881816 

978 

it# 

8 Afghanistan Asia 

1987 

40.8 13867957 

852 

## 

9 Afghanistan Asia 

1992 

41.7 16317921 

649 

## 10 Afghanistan Asia 

1997 

41.8 22227415 

635 


## # ... with 1,694 more rows 


Remember, we can use the str () function to learn more about 
the internal structure of any object. For example, we can get some 
information on what class (or classes) of object gapminder is, 
how large it is, and what components it has. The output from 
str (gapminder) is somewhat dense: 

## Classes ■tbl_df ■, ■tbl' and 'data.frame 1 : 1704 obs. of 6 variables: 

## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 ... 

I# $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 
## ... 

## $ year : int 1952 1957 ... 

## $ lifeExp : num 28.8 ... 

## $ pop : int 8425333 9240934 ... 

## $ gdpPercap: num 779 ... 

There is a lot of information here about the object as a whole 
and each variable in it. In the same way, statistical models in R 
have an internal structure. But because models are more complex 
entities than data tables, their structure is correspondingly more 
complicated. There are more pieces of information, and more 
kinds of information, that we might want to use. All this infor¬ 
mation is generally stored in or computable from parts of a model 
object. 

We can create a linear model, a standard OLS regression, using 
the gapminder data. This dataset has a country-year structure that 
makes an OLS specification like this the wrong one to use. But 
never mind that for now. We use the lm() function to run the 
model and store it in an object called out: 

out lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder) 


The first argument is the formula for the model. lifeExp is the 
dependent variable and the tilde operator is used to designate the 
left and right sides of a model (including in cases, as we saw with 
facet_wrap() where the model just has a right side). 



Let’s look at the results by asking R to print a summary of the 
model. 

summary(out) 


## 

## Call: 

## lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder) 
## 

## Residuals: 

## Min IQ Median 3Q Max 
## -49.16 -4.49 0.30 5.11 25.17 


## Coefficients: 


## 

Estimate 

Std. Error t value 

Pr(>| t |) 

## (Intercept) 

4.78e+01 

3.40e-01 140.82 

<2e-16 *** 

## gdpPercap 

4.50e-04 

2.35e-05 19.16 

<2e-16 *** 

## pop 

6.57e-09 

1.98e-09 3.33 

9e-04 *** 

## continentAmericas 

1.35e+01 

6.00e-01 22.46 

<2e-16 *** 

## continentAsia 

8.19e+00 

5.71e-01 14.34 

<2e-16 *** 

## continentEurope 

1.75e+01 

6.25e-01 27.97 

<2e-16 *** 

## continentOceania 

1.81e+01 

1.78e+00 10.15 

<2e-16 *** 

## — 




## Signif. codes: 0 

,***, 0 

001 '**' 0.01 1 

0.05 1 . ' 0.1 

## 




## Residual standard 

error: 8 

.37 on 1697 degrees 

of freedom 


## Multiple R-squared: 0.582, Adjusted R-squared: 0.581 
## F-statistic: 394 on 6 and 1697 DF, p-value: <2e-16 

When we use the summary () function on out, we are not 
getting a simple feed of what’s in the model object. Instead, 
like any function, summaryQ takes its input, performs some 
actions, and produces output. In this case, what is printed 
to the console is partly information that is stored inside the 
model object, and partly information that the summary () function 
has calculated and formatted for display on the screen. Behind 
the scenes, summary () gets help from other functions. Objects 
of different classes have default methods associated with them, 
so that when the generic summaryQ function is applied to a 
linear model object, the function knows to pass the work on to 
a more specialized function that does a bunch of calculations 
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out 


— coefficients 

— residuals 

— effects 

— rank 


— qr 
i-qr 

— pivot 

— qraux 

— tol 
—. rank 


residual 
contrasts 
xlevels 
— call 
terms 

model.frame 


Figure 6.3: Schematic view of a linear model object. 


Try out$df. residual at the console. 

Try outjmodel, but be prepared fora lot of stuff to 
be printed at the console. 


and formatting appropriate to a linear model object. We use the 
same generic summary () function on data frames, as in summary 
(gapminder), but in that case a different default method is 
applied. 

The output from summa ry () gives a precis of the model, but we 
can’t really do any further analysis with it directly. For example, 
what if we want to plot something from the model? The infor¬ 
mation necessary to make plots is inside the object, but it is not 
obvious how to use it. 

If we take a look at the structure of the model object with 
str(out) we will find that there is a lot of information in there. 
Like most complex objects in R, out is organized as a list of com¬ 
ponents or elements. Several of these elements are themselves lists. 
Figure 6.3 gives you a schematic view of the contents of a lin¬ 
ear model object. In this list of items, some elements are single 
values, some are data frames, and some are additional lists of sim¬ 
pler items. Again, remember our earlier discussion where we said 
objects could be thought of as being organized like a filing sys¬ 
tem: cabinets contain drawers, and a drawer may contain pages 
of information, whole documents, or groups of folders with more 
documents inside. As an alternative analogy, and sticking with the 
image of a list, you can think of a master to-do list for a project, 
where the top-level headings lead to additional lists of tasks of 
different kinds. 

The out object created by lm contains several different named 
elements. Some, like the residual degrees of freedom in the model, 
are just a single number. Others are much larger entities, such as 
the data frame used to fit the model, which is retained by default. 
Other elements have been computed by R and then stored, such 
as the coefficients of the model and other quantities. You can try 
out$coefficients, out$residuals, and out$fitted. values, for 
instance. Others are lists themselves (like qr). So you can see 
that the summa ry () function is selecting and printing only a small 
amount of core information, in comparison to what is stored in the 
model object. 

Just like the tables of data we saw in section 6.1, the output 
of summary () is presented in a way that is compact and efficient in 
terms of getting information across but also untidy when consid¬ 
ered from the point of view of further manipulation. There is a 
table of coefficients, but the variable names are in the rows. The 
column names are awkward, and some information (e.g., at the 




bottom of the output) has been calculated and printed out but is 
not stored in the model object. 


6.3 Get Model-Based Graphics Right 

Figures based on statistical models face all the ordinary chal¬ 
lenges of effective data visualization and then some. This is because 
model results usually carry a considerable extra burden of inter¬ 
pretation and necessary background knowledge. The more com¬ 
plex the model, the trickier it becomes to convey this information 
effectively, and the easier it becomes to lead ones audience or one¬ 
self into error. Within the social sciences, our ability to clearly and 
honestly present model-based graphics has greatly improved over 
the past ten or fifteen years. Over the same period, it has become 
clearer that some kinds of models are quite tricky to under¬ 
stand, even ones that had previously been seen as straightforward 
elements of the modeling toolkit (Ai & Norton 2003; Brambor, 
Clark, & Golder 2006). 

Plotting model estimates is closely connected to properly esti¬ 
mating models in the first place. This means there is no substitute 
for learning the statistics. You should not use graphical methods 
as a substitute for understanding the model used to produce them. 
While this book cannot teach you that material, we can make a 
few general points about what good model-based graphics look 
like, and work through some examples of how ggplot and some 
additional libraries can make it easier to get good results. 


Present your findings in substantive terms 

Useful model-based plots show results in ways that are substan¬ 
tively meaningful and directly interpretable with respect to the 
questions the analysis is trying to answer. This means showing 
results in a context where other variables in the analysis are held 
at sensible values, such as their means or medians. With contin¬ 
uous variables, it can often be useful to generate predicted values 
that cover some substantively meaningful move across the distri¬ 
bution, such as from the 25th to the 75th percentile, rather than 
a single-unit increment in the variable of interest. For unordered 
categorical variables, predicted values might be presented with 
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respect to the modal category in the data, or for a particular cate¬ 
gory of theoretical interest. Presenting substantively interpretable 
findings often also means using (and sometimes converting to) 
a scale that readers can easily understand. If your model reports 
results in log-odds, for example, converting the estimates to pre¬ 
dicted probabilities will make it easier to interpret. All this advice 
is quite general. Each of these points applies equally well to the pre¬ 
sentation of summary results in a table rather than a graph. There 
is nothing distinctively graphical about putting the focus on the 
substantive meaning of your findings. 


Show your degree of confidence 

Much the same applies to presenting the degree of uncertainty or 
confidence you have in your results. Model estimates come with 
various measures of precision, confidence, credence, or signifi¬ 
cance. Presenting and interpreting these measures is notoriously 
prone to misinterpretation, or overinterpretation, as researchers 
and audiences both demand more from things like confidence 
intervals and p-values than these statistics can deliver. At a min¬ 
imum, having decided on an appropriate measure of model fit or 
the right assessment of confidence, you should show their range 
when you present your results. A family of ggplot geoms allows you 
to show a range or interval defined by position on the x-axis and 
then a ymin and ymax range on the y-axis. These geoms include 
geom_pointrange() and geom_errorbar(), which we will see in 
action shortly. A related geom, geom_ ribbon () uses the same argu¬ 
ments to draw filled areas and is useful for plotting ranges of y-axis 
values along some continuously varying x-axis. 


Show your data when you can 

Plotting the results from a multivariate model generally means 
one of two things. First, we can show what is in effect a table 
of coefficients with associated measures of confidence, perhaps 
organizing the coefficients into meaningful groups, or by the size 
of the predicted association, or both. Second, we can show the 
predicted values of some variables (rather than just a model’s coef¬ 
ficients) across some range of interest. The latter approach lets us 
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show the original data points if we wish. The way ggplot builds 
graphics layer by layer allows us to easily combine model estimates 
(e.g., a regression line and an associated range) and the underly¬ 
ing data. In effect these are manually constructed versions of the 
automatically generated plots that we have been producing with 
geom_smoothQ since the beginning of this book. 


6.4 Generate Predictions to Graph 

Having fitted a model, then, we might want to get a picture of 
the estimates it produces over the range of some particular vari¬ 
able, holding other covariates constant at some sensible values. 

The predictQ function is a generic way of using model objects to 
produce this kind of prediction. In R, “generic” functions take their 
inputs and pass them along to more specific functions behind the 
scenes, ones that are suited to working with the particular kind of 
model object we have. The details of getting predicted values from 
an OLS model, for instance, will be somewhat different from get¬ 
ting predictions out of a logistic regression. But in each case we can 
use the same predictQ function, taking care to check the docu¬ 
mentation to see what form the results are returned in for the kind 
of model we are working with. Many of the most commonly used 
functions in R are generic in this way. The summary () function, 
for example, works on objects of many different classes, from 
vectors to data frames and statistical models, producing appropri¬ 
ate output in each case by way of a class-specific function in the 
background. 

For predictQ to calculate the new values for us, it needs some 
new data to fit the model to. We will generate a new data frame 
whose columns have the same names as the variables in the model’s 
original data, but where the rows have new values. A very useful 
function called expand.grid () will help us do this. We will give it 
a list of variables, specifying the range of values we want each vari¬ 
able to take. Then expand .grid () will generate then will multiply 

out the full range of values for all combinations of the values we The function calculates the Cartesian product of the 
give it, thus creating a new data frame with the new data we need. vanables 9 |ven t0 rt - 

In the following bit of code, we use minQ and maxQ to 
get the minimum and maximum values for per capita GDP and 
then create a vector with one hundred evenly spaced elements 
between the minimum and the maximum. We hold population 
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constant at its median, and we let continent take all its five available 
values. 


min_gdp x- min(gapminder$gdpPercap) 
max_gdp x- max(gapminder$gdpPercap) 
med_pop x- median(gapminder$pop) 

pred_df x- expand.grid(gdpPercap = (seq(from = min_gdp, to = max_gdp, 
length.out = 100)), pop = med_pop, continent = c("Africa", 
"Americas", "Asia", "Europe", "Oceania")) 

dim(pred_df) 


## [1] 500 3 


head(pred_df) 


## 


gdpPercap 

pop 

continent 

## 

1 

241.166 

7023596 

Africa 

## 

2 

1385.428 

7023596 

Africa 

## 

3 

2529.690 

7023596 

Africa 

## 

4 

3673.953 

7023596 

Africa 

## 

5 

4818.215 

7023596 

Africa 

## 

6 

5962.477 

7023596 

Africa 


Now we can use predictQ. If we give the function our new 
data and model, without any further argument it will calculate 
the fitted values for every row in the data frame. If we spec¬ 
ify interval = “predict” as an argument, it will calculate 95 
percent prediction intervals in addition to the point estimate. 

pred_out predict(object = out, newdata = pred_df, interval = "predict") 
head(pred_out) 


## fit lwr upr 
## 1 47.9686 31.5477 64.3895 
## 2 48.4830 32.0623 64.9037 
## 3 48.9973 32.5767 65.4180 
## 4 49.5117 33.0909 65.9325 
## 5 50.0260 33.6050 66.4471 
## 6 50.5404 34.1189 66.9619 
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Because we know that, by construction, the cases in precLdf 
and precLout correspond row for row, we can bind the two data 
frames together by column. This method of joining or merging 
tables is definitely not recommended when you are dealing with 
data. 


precLdf •<- cbind(pred_df, pred_out) 
head(pred_df) 


## 

gdpPercap 

pop 

continent 

fit 

lwr 

upr 

## 1 

241 

7023596 

Africa 

48.0 

31.5 

64.4 

## 2 

1385 

7023596 

Africa 

48.5 

32.1 

64.9 

## 3 

2530 

7023596 

Africa 

49.0 

32.6 

65.4 

## 4 

3674 

7023596 

Africa 

49.5 

33.1 

65.9 

## 5 

4818 

7023596 

Africa 

50.0 

33.6 

66.4 

## 6 

5962 

7023596 

Africa 

50.5 

34.1 

67.0 


The end result is a tidy data frame, containing the predicted 
values from the model for the range of values we specified. Now 
we can plot the results. Because we produced a full range of pre¬ 
dicted values, we can decide whether to use all of them. Here we 
further subset the predictions to just those for Europe and Africa 
(fig. 6.4). 


Continent — Africa Europe 



gdp Percap 

Figure 6.4: OLS predictions. 


p ggplot(data = subset(pred_df, continent 'l.in'l. c("Europe", "Africa")), 
aes(x = gdpPercap, 

y = fit, ymin = lwr, ymax = upr, 
color = continent, 
fill = continent, 
group = continent)) 


p + geom_point(data = subset(gapminder, 

continent /.in/. c("Europe", "Africa")), 
aes(x = gdpPercap, y = lifeExp, 
color = continent), 
alpha = 0.5, 
inherit.aes = FALSE) + 
geom_line() + 

geom_ribbon(alpha = 0.2, color = FALSE) + 
scale_x_log10(labels = scales::dollar) 
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We use a new geom here to draw the area covered by the pre¬ 
diction intervals: geom_ribbon(). It takes an x argument like a 
line but a ymin and ymax argument as specified in the ggplotQ 
aesthetic mapping. This defines the lower and upper limits of the 
prediction interval. 

In practice, you may not use predictQ directly all that often. 
Instead, you might write code using additional packages that 
encapsulate the process of producing predictions and plots from 
models. These are especially useful when your model is a little 
more complex and the interpretation of coefficients becomes trick¬ 
ier. This happens, for instance, when you have a binary outcome 
variable and need to convert the results of a logistic regression into 
predicted probabilities, or when you have interaction terms among 
your predictions. We will discuss some of these helper packages in 
the next few sections. However, bear in mind that predictQ and 
its ability to work safely with different classes of model underpins 
most of these helpers. So it’s useful to see it in action firsthand in 
order to understand what it is doing. 


6.5 Tidy Model Objects with Broom 

The predict method is very useful, but there are a lot of other 
things we might want to do with our model output. We will use 
David Robinsons broom package to help us out. It is a library of 
functions that help us get from the model results that R generates 
to numbers that we can plot. It will take model objects and turn 
pieces of them into data frames that you can use easily with ggplot. 


library(broom) 


Broom takes ggplot’s approach to tidy data and extends it 
to the model objects that R produces. Its methods can tidily 
extract three kinds of information. First, we can see component- 
level information about aspects of the model itself, such as coef¬ 
ficients and t-statistics. Second, we can obtain observation-level 
information about the model’s connection to the underlying data. 
This includes the fitted values and residuals for each observation 
in the data. And finally we can get model-level information that 
summarizes the fit as a whole, such as an F-statistic, the model 



deviance, or the r-squared. There is a broom function for each of 
these tasks. 
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Get component-level statistics with tidy() 


The tidy () function takes a model object and returns a data frame 
of component-level information. We can work with this to make 
plots in a familiar way, and much more easily than fishing inside 
the model object to extract the various terms. Here is an example, 
using the default results as just returned. For a more convenient 
display of the results, we will pipe the object we create with tidy () 
through a function that rounds the numeric columns of the data 
frame to two decimal places. This doesn’t change anything about 
the object itself, of course. 


out_comp <- tidy(out) 
out_comp V.>V. round_df() 


## 

term 

estimate std 

error 

statistic p 

value 

## 1 

(Intercept) 

47.81 

0.34 

140.82 

0 

## 2 

gdpPercap 

0.00 

0.00 

19.16 

0 

## 3 

pop 

0.00 

0.00 

3.33 

0 

## 4 continentAmericas 

13.48 

0.60 

22.46 

0 

## 5 

continentAsia 

8.19 

0.57 

14.34 

0 

## 6 

continentEurope 

17.47 

0.62 

27.97 

0 

## 7 

continentOceania 

18.08 

1.78 

10.15 

0 


We are now able to treat this data frame just like all the other 
data that we have seen so far, and use it to make a plot (fig. 6.5). 


p ggplot(out_comp, mapping = aes(x = term, y = estimate)) 
p + geom_point() + coord_flip() 


Pop- • 
gdp Percap - • 

Continent Oceania - • 

5 Continent Europe - • 

Continent Asia - • 

Continent Americas - • 

(Intercept) - 

0 10 20 30 

Estimate 

Figure 6.5: Basic plot of OLS estimates. 


We can extend and clean up this plot in a variety of ways. For 
example, we can tell tidy () to calculate confidence intervals for 
the estimates, using R’s confintQ function. 

out_conf <- tidy(out, conf.int = TRUE) 
out_conf l>l round_df() 
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## 

term 

estimate std 

error 

statistic p 

value 

conf.low 

conf.high 

## 1 

(Intercept) 

47.81 

0.34 

140.82 

0 

47.15 

48.48 

## 2 

gdpPercap 

0.00 

0.00 

19.16 

0 

0.00 

0.00 

## 3 

pop 

0.00 

0.00 

3.33 

0 

0.00 

0.00 

## 4 continentAmericas 

13.48 

0.60 

22.46 

0 

12.30 

14.65 

## 5 

continentAsia 

8.19 

0.57 

14.34 

0 

7.07 

9.31 

## 6 

continentEurope 

17.47 

0.62 

27.97 

0 

16.25 

18.70 

## 7 

continentOceania 

18.08 

1.78 

10.15 

0 

14.59 

21.58 


The convenience “not in” operator /nil'll is available via the 
socviz library. It does the opposite of XinX and selects only the 
items in a first vector of characters that are not in the second. We’ll 
use it to drop the intercept term from the table. We also want to 
do something about the labels. When fitting a model with categor¬ 
ical variables, R will create coefficient names based on the variable 
name and the category name, like continentAmericas. Normally 
we like to clean these up before plotting. Most commonly, we 
just want to strip away the variable name at the beginning of the 
coefficient label. For this we can use prefix_strip(), a conve¬ 
nience function in the socviz library. We tell it which prefixes to 
drop, using it to create a new column variable in out_conf that 
corresponds to the terms column but has nicer labels. 

out_conf <- subset(out_conf, term y.niny. "(Intercept)") 
out_conf$nicelabs <- prefix_strip(out_conf$term, "continent") 

Now we can use geom_pointrange() to make a figure (fig. 6.6) 
that displays some information about our confidence in the vari¬ 
able estimates, as opposed to just the coefficients. As with the 
boxplots earlier, we use reorderQ to sort the names of the model’s 
terms by the estimate variable, thus arranging our plot of effects 
from largest to smallest in magnitude. 

p <- ggplot(out_conf, mapping = aes(x = reorder(nicelabs, estimate), 
y = estimate, ymin = conf.low, ymax = conf.high)) 
p + geom_pointrange() + coord_flip() + labs(x = y = "OLS Estimate") 


Oceania 
Europe 
Americas 
Asia 
gdp Percap 
pop 


Figure 6.6: A nicer plot of OLS estimates and 
confidence intervals. 



Dotplots of this kind can be very compact. The vertical axis 
can often be compressed quite a bit, with no loss in comprehension. 
In fact, they are often easier to read with much less room between 
the rows than when given a default square shape. 
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Get observation-level statistics with augment() 

The values returned by augmeritQ are all statistics calculated at 
the level of the original observations. As such, they can be added 
on to the data frame that the model is based on. Working from 
a call to augment() will return a data frame with all the original 
observations used in the estimation of the model, together with 
columns like the following: 

. . fitted — the fitted values of the model 

. . se. fit — the standard errors of the fitted values 

. . resid — the residuals 

. .hat — the diagonal of the hat matrix 

. . sigma — an estimate of residual standard deviation when the 

corresponding observation is dropped from the model 
. . cooksd — Cook’s distance, a common regression diagnostic 

. . std. resid — the standardized residuals 

Each of these variables is named with a leading dot, for exam¬ 
ple .hat rather than hat. This is to guard against accidentally 
confusing it with (or accidentally overwriting) an existing variable 
in your data with this name. The columns of values return will 
differ slightly depending on the class of model being fitted. 


out_aug augment(out) 
head(out_aug) round_df() 


## 

lifeExp gdpPercap 

pop 

continent 

.fitted 

se .fit . 

'•esid .hat 

.sigma .cooksd .std 

resid 

## 1 

28.8 

779 

8425333 

Asia 

56.4 

0.47 

-27.6 

0 

8.34 

0.01 

-3.31 

## 2 

30.3 

821 

9240934 

Asia 

56.4 

0.47 

-26.1 

0 

8.34 

0.00 

-3.13 

## 3 

32.0 

853 

10267083 

Asia 

56.5 

0.47 

-24.5 

0 

8.35 

0.00 

-2.93 

## 4 

34.0 

836 

11537966 

Asia 

56.5 

0.47 

-22.4 

0 

8.35 

0.00 

-2.69 

## 5 

36.1 

740 

13079460 

Asia 

56.4 

0.47 

-20.3 

0 

8.35 

0.00 

-2.44 

## 6 

38.4 

786 

14880372 

Asia 

56.5 

0.47 

-18.0 

0 

8.36 

0.00 

-2.16 


By default, augmentQ will extract the available data from the 
model object. This will usually include the variables used in the 
model itself but not any additional ones contained in the original 
data frame. Sometimes it is useful to have these. We can add them 
by specifying the data argument: 
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out_aug •<- augment(out, data = gapminder) 
head(out_aug) round_df() 


## country continent year lifeExp pop gdpPercap .fitted .se.fit .resid .hat .sigma .cooksd 


1 Afghanistan 

Asia 1952 

28.8 8425333 

779 

56.4 

0.47 

-27.6 

0 

8.34 

0.01 

2 Afghanistan 

Asia 1957 

30.3 9240934 

821 

56.4 

0.47 

-26.1 

0 

8.34 

0.00 

3 Afghanistan 

Asia 1962 

32.0 10267083 

853 

56.5 

0.47 

-24.5 

0 

8.35 

0.00 

4 Afghanistan 

Asia 1967 

34.0 11537966 

836 

56.5 

0.47 

-22.4 

0 

8.35 

0.00 

5 Afghanistan 

Asia 1972 

36.1 13079460 

740 

56.4 

0.47 

-20.3 

0 

8.35 

0.00 

6 Afghanistan 

Asia 1977 

38.4 14880372 

786 

56.5 

0.47 

-18.0 

0 

8.36 

0.00 


## .std.resid 
## 1 -3.31 
## 2 -3.13 
## 3 -2.93 
## 4 -2.69 
## 5 -2.44 
## 6 -2.16 


-25 



50 60 70 80 90 100 

.fitted 


Figure 6.7: Residuals vs fitted values. 


If some rows containing missing data were dropped to fit the 
model, then these will not be carried over to the augmented data 
frame. 

The new columns created by augment() can be used to cre¬ 
ate some standard regression plots. For example, we can plot the 
residuals versus the fitted values. Figure 6.7 suggests, unsurpris¬ 
ingly, that our country-year data has rather more structure than is 
captured by our OLS model. 

p •<- ggplot(data = out_aug, mapping = aes(x = .fitted, y = .resid)) 
p + geom_point() 


Get model-level statistics with gianceO 

The glance () function organizes the information typically pre¬ 
sented at the bottom of a model’s summary () output. By itself, it 
usually just returns a table with a single row in it. But as we shall 
see in a moment, the real power of b room’s approach is the way that 
it can scale up to cases where we are grouping or subsampling our 
data. 


glance (out) round_df() 
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## r.squared adj.r.squared sigma statistic p.value df 

## 1 0.58 0.58 8.37 393.91 0 7 

## logLik AIC BIC deviance df.residual 
## 1 -6033.83 12083.6 12127.2 118754 1697 

Broom is able to tidy (and augment, and glance at) a wide 
range of model types. Not all functions are available for all classes 
of model. Consult broom’s documentation for more details on 
what is available. For example, here is a plot created from the 
tidied output of an event-history analysis. First we generate a Cox 
proportional hazards model of some survival data. 

library(survival) 

out_cph <- coxph(Surv(time, status) ~ age + sex, data = lung) 
out_surv <- survfit(out_cph) 


The details of the fit are not important here, but in the first step 
the SurvQ function creates the response or outcome variable for 
the proportional hazards model that is then fitted by the coxphQ 
function. Then the survfit() function creates the survival curve 
from the model, much like we used predictQ to generate pre¬ 
dicted values earlier. Try summary (out_cph) to see the model, and 
summary(out_surv) to see the table of predicted values that will 
form the basis for our plot. Next we tidy out_surv to get a data 
frame, and plot it (fig. 6.8). 

out_tidy <- tidy(out_surv) 

p <- ggplot(data = out_tidy, mapping = aes(time, estimate)) 
p + geom_line() + geom_ribbon(mapping = aes(ymin = conf.low, 
ymax = conf.high), alpha = 0.2) 



Figure 6.8: A Kaplan-Meier plot. 


6.6 Grouped Analysis and List Columns 

Broom makes it possible to quickly fit models to different sub¬ 
sets of your data and get consistent and usable tables of results out 
the other end. Let’s say we wanted to look at the gapminder data 
by examining the relationship between life expectancy and GDP 
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by continent, for each year in the data. The gapminder data is at 
bottom organized by country-years. That is the unit of observa¬ 
tion in the rows. If we wanted, we could take a slice of the data 
manually, such as “all countries observed in Asia, in 1962” or “all 
in Africa, 2002.” Here is “Europe, 1977”: 

eu77 gapminder filter(continent == "Europe", year == 1977) 

We could then see what the relationship between life 
expectancy and GDP looked like for that continent-year group: 


fit lm(lifeExp ~ log(gdpPercap), data = eu77) 
summary (fit) 


## Call: 

## lm(formula = lifeExp ~ log(gdpPercap), data = eu77) 
## 

## Residuals: 

## Min IQ Median 3Q Max 


it# 

-7.496 -1.031 

0.093 1.176 

3.712 


## 





## 

Coefficients: 




## 


Estimate Std. 

Error t value 

P r(> 111) 

## 

(Intercept) 

29.489 

7.161 4.12 

0.00031 *** 

## 

log(gdpPercap) 

4.488 

0.756 5.94 

2.2e-06 *** 

## 

— 




## 

Signif. codes: 




## 

0 '***' 0.001 

,**, 0.01 

0.05 ' . ' 0.1 

' ■ 1 


## Residual standard error: 2.11 on 28 degrees of freedom 
## Multiple R-squared: 0.557, Adjusted R-squared: 0.541 
## F-statistic: 35.2 on 1 and 28 DF, p-value: 2.17e-06 

With dplyr and broom we can do this for every continent- 
year slice of the data in a compact and tidy way. We start with our 
table of data and then {'/<>'/<) group the countries by continent and 
year using the group_by () function. We introduced this grouping 
operation in chapter 4. Our data is reorganized first by continent, 
and within continent by year. Here we will take one further step 
and nest the data that makes up each group: 



out_le •<- gapminder l>l 

group_by(continent, year) 
nest() 

out_le 


## # A tibble: 60 x 3 


## 


continent 

year 

data 




## 


<fct> 

<int> 

<list> 




## 

1 

Asia 

1952 

<tibble 

[33 

X 

4]> 

## 

2 

Asia 

1957 

<tibble 

[33 

X 

4]> 

## 

3 

Asia 

1962 

<tibble 

[33 

X 

4]> 

## 

4 

Asia 

1967 

<tibble 

[33 

X 

4]> 

## 

5 

Asia 

1972 

<tibble 

[33 

X 

4]> 

## 

6 

Asia 

1977 

<tibble 

[33 

X 

4]> 

## 

7 

Asia 

1982 

<tibble 

[33 

X 

4]> 

## 

8 

Asia 

1987 

<tibble 

[33 

X 

4]> 

## 

9 

Asia 

1992 

<tibble 

[33 

X 

4]> 

## 10 Asia 

1997 

<tibble 

[33 

X 

4]> 

## # . 

.. with 50 

more 

rows 





Think of what nest() does as a more intensive version what 
group_by() does. The resulting object is has the tabular form we 
expect (it is a tibble), but it looks a little unusual. The first two 
columns are the familiar continent and year. But we now also 
have a new column, data, that contains a small table of data cor¬ 
responding to each continent-year group. This is a list column, 
something we have not seen before. It turns out to be very use¬ 
ful for bundling together complex objects (structured, in this case, 
as a list of tibbles, each being a 33 x 4 table of data) within the rows 
of our data (which remains tabular). Our “Europe 1977” fit is in 
there. We can look at it, if we like, by filtering the data and then 
unnesting the list column. 

out_le l>l filter(continent == "Europe" & year == 1977) unnest() 

## # A tibble: 30 x 6 

## continent year country lifeExp pop gdpPercap 

## <fct> <int> <fct> <dbl> <int> <dbl> 

## 1 Europe 1977 Albania 68.9 2.51e6 3533. 

## 2 Europe 1977 Austria 


72.2 7.57e6 19749. 
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The map action is an important idea in functional 
programming. If you have written code in other, 
more imperative languages, you can think of it as a 
compact alternative to writing for... next loops. You 
can of course write loops like this in R. 
Computationally they are often not any less efficient 
than their functional alternatives. But mapping 
functions to arrays is more easily integrated into a 
sequence of data transformations. 


## 

3 

Europe 

1977 

Belgium 

72.8 

9.82e6 

19118 

## 

4 

Europe 

1977 

Bosnia and Her~ 

69.9 

4.09e6 

3528 

## 

5 

Europe 

1977 

Bulgaria 

70.8 

8.80e6 

7612 

## 

6 

Europe 

1977 

Croatia 

70.6 

4.32e6 

11305 

## 

7 

Europe 

1977 

Czech Republic 

70.7 

1.02e7 

14800 

## 

8 

Europe 

1977 

Denmark 

74.7 

5.09e6 

20423 

## 

9 

Europe 

1977 

Finland 

72.5 

4.74e6 

15605 

## 

10 

Europe 

1977 

France 

73.8 

5.32e7 

18293 

## 

# • 

.. with 

20 more 

rows 





List columns are useful because we can act on them in a com¬ 
pact and tidy way. In particular, we can pass functions along to 
each row of the list column and make something happen. For 
example, a moment ago we ran a regression of life expectancy and 
logged GDP for European countries in 1977. We can do that for 
every continent-year combination in the data. We first create a 
convenience function called flt_ols() that takes a single argu¬ 
ment, df (for data frame), and that fits the linear model we are 
interested in. Then we map that function to each of our list col¬ 
umn rows in turn. Recall from chapter 4 that mutate creates new 
variables or columns within a pipeline. 


fit_ols function(df) { 

lm(lifeExp ~ log(gdpPercap), data = df) 

1 

out_le <- gapminder l>'/. 

group_by(continent, year) 
nest() •/.>■/. 

mutate(model = map(data, fit_ols)) 
out_le 


## # A tibble: 60 x 4 


## continent year data model 

## <fct> <int> <list> <list> 


## 1 Asia 
## 2 Asia 
## 3 Asia 
## 4 Asia 
## 5 Asia 


1952 <tibble [33 x 

4]> 

CO 

oo 

V 

lm> 

1957 <tibble [33 x 

4]> 

oo 

oo 

V 

lm> 

1962 <tibble [33 x 

4]> 

oo 

oo 

V 

lm> 

1967 <tibble [33 x 

4]> 

oo 

oo 

V 

lm> 

1972 <tibble [33 x 

4]> 

<S3 

lm> 




## 10 Asia 


## 6 Asia 


## 7 Asia 


## 8 Asia 


## 9 Asia 


1977 <tibble [33 x 4]> <S3: lm> 
1982 <tibble [33 x 4]> <S3: lm> 
1987 <tibble [33 x 4]> <S3: lm> 
1992 <tibble [33 x 4]> <S3: lm> 
1997 <tibble [33 x 4]> <S3: lm> 


## # ... with 50 more rows 

Before starting the pipeline we create a new function: it is a 
convenience function whose only job is to estimate a particular 
OLS model on some data. Like almost everything in R, func¬ 
tions are a kind of object. To make a new one, we use the slightly 
special functionQ function. There is a little more detail on cre¬ 
ating functions in the appendix. To see what fit_ols() looks 
like once it is created, type fit_ols without parentheses at the 
console. To see what it does, try flt_ols(df = gapminder) or 
summary (fit_ols(gapuiinder)). 

Now we have two list columns: data and model. The latter was 
created by mapping the fit_ols() function to each row of data. 
Inside each element of model is a linear model for that continent- 
year. So we now have sixty OLS fits, one for every continent-year 
grouping. Having the models inside the list column is not much 
use to us in and of itself. But we can extract the information we 
want while keeping things in a tidy tabular form. For clarity we 
will run the pipeline from the beginning again, this time adding a 
few new steps. 

First we extract summary statistics from each model by map¬ 
ping the tidy() function from broom to the model list column. 
Then we unnest the result, dropping the other columns in the pro¬ 
cess. Finally, we filter out all the Intercept terms and also drop all 
observations from Oceania. In the case of the Intercepts we do 
this just out of convenience. Oceania we drop just because there 
are so few observations. We put the results in an object called 
out_tidy. 

fit_ols <- function(df) { 


lm(lifeExp ~ log(gdpPercap), data = df) 


} 


out_tidy «- gapminder 

group_by(continent, year) 
nest() •/.>!. 
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mutate(model = map(data, fit_ols), 
tidied = map (model, tidy)) 
unnest(tidied, .drop = TRUE) /.>/, 
filter(term /'.niny, "(Intercept)" & 
continent /.nin/, "Oceania") 


out_tidy V.>V. sample_n(5) 


## 

## 

# 

A tibble: 

continent 

5x7 

year 

term 

estimate std 

.error 

statistic 

p.value 

## 


<fct> 

<int> 

<chr> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

## 

1 

Europe 

1987 

log(gdpPercap) 

4.14 

0.752 

5.51 

0.00000693 

## 

2 

Asia 

1972 

log(gdpPercap) 

4.44 

1.01 

4.41 

0.000116 

## 

3 

Europe 

1972 

log(gdpPercap) 

4.51 

0.757 

5.95 

0.00000208 

## 

4 

Americas 

1952 

log(gdpPercap) 

10.4 

2.72 

3.84 

0.000827 

## 

5 

Asia 

1987 

log(gdpPercap) 

5.17 

0.727 

7.12 

0.0000000531 


We now have tidy regression output with an estimate of the 
association between log GDP per capita and life expectancy for 
each year, within continents. We can plot these estimates (fig. 6.9) 
in a way that takes advantage of their groupiness. 


p ggplot(data = out_tidy, 

mapping = aes(x = year, y = estimate, 

ymin = estimate - 2*std.error, 
ymax = estimate + 2*std.error, 
group = continent, color = continent)) 


Continent ^ Africa «jt> Americas Asia Europe 



Figure 6.9: Yearly estimates of the association between GDP and life expectancy, pooled by continent. 
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p + geom_pointrange(position = position_dodge(width = 1)) + 
scale_x_continuous(breaks = unique(gapminder$year)) + 
theme(legend.position = "top") + 
labs(x = "Year", y = "Estimate", color = "Continent") 

The call to position_dodge() within geom_pointrange() 
allows the point ranges for each continent to be near one another 
within years, instead of being plotted right on top of one another. 
We could have faceted the results by continent, but doing it this 
way lets us see differences in the yearly estimates much more eas¬ 
ily. This technique is very useful not just for cases like this but 
also when you want to compare the coefficients given by differ¬ 
ent kinds of statistical model. This sometimes happens when we’re 
interested in seeing how, say, OLS performs against some other 
model specification. 


6.7 Plot Marginal Effects 

Our earlier discussion of predict () was about obtaining estimates 
of the average effect of some coefficient, net of the other terms in 
the model. Over the past decade, estimating and plotting partial or 
marginal effects from a model has become an increasingly common 
way of presenting accurate and interpretively useful predictions. 

Interest in marginal effects plots was stimulated by the realiza¬ 
tion that the interpretation of terms in logistic regression models, 
in particular, was trickier than it seemed—especially when there 
were interaction terms in the model (Ai & Norton 2003). Thomas 
Leeper’s margins package can make these plots for us. 

library(margins) 

To see it in action, we’ll take another look at the General 
Social Survey data in gss_sm, this time focusing on the binary vari¬ 
able Obama. It is coded 1 if the respondent said he or she voted As is common with retrospective questions on 

for Barack Obama in the 2012 presidential election and 0 other- elections, rather more people claim to have voted 

for Obama than is consistent with the vote share 

wise. In this case, mostly for convenience, the zero code includes he received in the election, 
all other answers to the question, including those who said they 
voted for Mitt Romney, those who said they did not vote, those 
who refused to answer, and those who said they didn’t know who 




they voted for. We will fit a logistic regression on Obama, with age, 
polviews, race, and sex as the predictors. The age variable is the 
respondent’s age in years. The sex variable is coded as “Male” or 
“Female,” with “Male” as the reference category. The race variable 
is coded as “White,” “Black,” or “Other,” with “White” as the ref¬ 
erence category. The polviews measure is a self-reported scale of 
the respondents political orientation from “Extremely Conserva¬ 
tive” through “Extremely Liberal,” with “Moderate” in the middle. 
We take polviews and create a new variable, polviews_m, using 
the relevel () function to recode “Moderate” to be the reference 
category. We fit the model with the glm() function and specify an 
interaction between race and sex. 


gss_sm$polviews_m <- relevel(gss_sm$polviews, ref = "Moderate") 

out_bo <- glm(obama ~ polviews_m + sex*race, 

family = "binomial", data = gss_sm) 
summary(out_bo) 


Call: 

glm(formula = obama ~ polviews_m + sex * race, family = "binomial", 
data = gss_sm) 

Deviance Residuals: 

Min IQ Median 3Q Max 
-2.905 -0.554 0.177 0.542 2.244 


Coefficients: 



Estimate 

Std. Error 

z value Pr(>|z|) 

(Intercept) 

0.29649 

0.13409 

2.21 

0.0270 

polviews_mExtremely Liberal 

2.37295 

0.52504 

4.52 

6.2e-06 

polviews_mLiberal 

2.60003 

0.35667 

7.29 

3.1e-13 

polviews_mSlightly Liberal 

1.29317 

0.24843 

5.21 

1.9e-07 

polviews_mSlightly Conservative 

-1.35528 

0.18129 

-7.48 

7.7e-14 

polviews_mConservative 

-2.34746 

0.20038 

-11.71 

< 2e-16 

polviews_mExtremely Conservative 

-2.72738 

0.38721 

-7.04 

1.9e-12 

sexFemale 

0.25487 

0.14537 

1.75 

0.0796 

raceBlack 

3.84953 

0.50132 

7.68 

1.6e-14 

raceOther 

-0.00214 

0.43576 

0.00 

0.9961 

sexFemale:raceBlack 

-0.19751 

0.66007 

-0.30 

0.7648 
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## sexFemale:raceOther 1.57483 0.58766 2.68 0.0074** 

## — 

## Signif. codes: 0 ■***■ 0.001 ■**■ 0.01 ■*■ 0.05 '.'0.1 ■ '1 

## 

## (Dispersion parameter for binomial family taken to be 1) 

## 

## Null deviance: 2247.9 on 1697 degrees of freedom 
## Residual deviance: 1345.9 on 1686 degrees of freedom 
## (1169 observations deleted due to missingness) 

## AIC: 1370 


## Number of Fisher Scoring iterations: 6 

The summary reports the coefficients and other information. 
We can now graph the data in any one of several ways. Using 
marginsO, we calculate the marginal effects for each variable: 


bo_m ■<- margins(out_bo) 
summary(bo_m) 


## 

factor 

AME 

SE 

z 


P 

lower 

upper 

## 

polviews_mConservative 

-0.4119 

0.0283 

-14.5394 

0 

0000 

-0.4674 

-0.3564 

## 

polviews_mExtremely Conservative 

-0.4538 

0.0420 

-10.7971 

0 

0000 

-0.5361 

-0.3714 

## 

polviews_mExtremely Liberal 

0.2681 

0.0295 

9.0996 

0 

0000 

0.2103 

0.3258 

## 

polviews_mLiberal 

0.2768 

0.0229 

12.0736 

0 

0000 

0.2319 

0.3218 

## 

polviews_mSlightly Conservative 

-0.2658 

0.0330 

-8.0596 

0 

0000 

-0.3304 

-0.2011 

## 

polviews_mSlightly Liberal 

0.1933 

0.0303 

6.3896 

0 

0000 

0.1340 

0.2526 

## 

raceBlack 

0.4032 

0.0173 

23.3568 

0 

0000 

0.3694 

0.4371 

## 

raceOther 

0.1247 

0.0386 

3.2297 

0 

0012 

0.0490 

0.2005 

## 

sexFemale 

0.0443 

0.0177 

2.5073 

0 

0122 

0.0097 

0.0789 


The margins library comes with several plot methods of its 
own. If you wish, at this point you can just try plot(bo_m) to 
see a plot of the average marginal effects, produced with the gen¬ 
eral look of a Stata graphic. Other plot methods in the margins 
library include cplotQ, which visualizes marginal effects condi¬ 
tional on a second variable, and imageQ, which shows predictions 
or marginal effects as a filled heatmap or contour plot. 

Alternatively, we can take results from marginsQ and plot 
them ourselves. To clean up the summary a little, we convert 
it to a tibble, then use prefix_strip() and prefix_replace() to 
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Figure 6.10: Average marginal effects plot. 


Race: Black 
Liberal 
Extremely liberal 
Slightly liberal 
Race:other 
Female 
Slightly conservative 
Conservative 
Extremely conservative 


tidy the labels. We want to strip the polviews_m and sex pre¬ 
fixes and (to avoid ambiguity about “Other”) adjust the race 
prefix. 



bo_gg <- as_tibble(summary(bo_m)) 

prefixes <- c("polviews_m", "sex") 

bo_gg$factor <- prefix_strip(bo_gg$factor, prefixes) 

bo_gg$factor <- prefix_replace(bo_gg$factor, "race", "Race: ") 

bo_gg V.>Y. select(factor, AME, lower, upper) 


## # A tibble: 9x4 




## factor 

AME 

lower 

upper 

## * <chr> 

<dbl> 

<dbl> 

<dbl> 

## 1 Conservative 

-0.412 

-0.467 

-0.356 

## 2 Extremely Conservative 

-0.454 

-0.536 

-0.371 

## 3 Extremely Liberal 

0.268 

0.210 

0.326 

## 4 Liberal 

0.277 

0.232 

0.322 

## 5 Slightly Conservative 

-0.266 

-0.330 

-0.201 

## 6 Slightly Liberal 

0.193 

0.134 

0.253 

## 7 Race: Black 

0.403 

0.369 

0.437 

## 8 Race: Other 

0.125 

0.0490 

0.200 

## 9 Female 

0.0443 

0.00967 

0.0789 


Now we have a table that we can plot (fig. 6.10) as we have 
learned: 


p •<- ggplot(data = bo_gg, aes(x = reorder(factor, AME), 

y = AME, ymin = lower, ymax = upper)) 

p + geom_hline(yintercept = 0, color = "gray80") + 
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geom_pointrange() + coord_ftip() + 

labs(x = NULL, y = "Average Marginal Effect") 


If we are just interested in getting conditional effects for a par¬ 
ticular variable, then conveniently we can ask the plot methods in 
the margins library to do the work calculating effects for us but 
without drawing their plot. Instead, they can return the results in 
a format we can easily use in ggplot, and with less need for cleanup. 
For example, with cplotQ we can draw figure 6.11. 


Female 

Male 


Figure 6.11: Conditional effects plot. 









o'.o 0 

2 0.4 0.6 


Conditional effect 


pv_cp cplot(out_bo, x = "sex", draw = FALSE) 

p ggplotfdata = pv_cp, aes(x = reorder(xvals, yvals), 

y = yvals, ymin = lower, ymax = upper)) 

p + geom_hline(yintercept = 0, color = "gray80") + 
geom_pointrange() + coord_f)ip() + 
labsfx = NULL, y = "Conditional Effect") 


The margins package is under active development. It can do 
much more than described here. The vignettes that come with 
the package provide more extensive discussion and numerous 
examples. 


6.8 Plots from Complex Surveys 

Social scientists often work with data collected using a complex 
survey design. Survey instruments may be stratified by region 
or some other characteristic, contain replicate weights to make 
them comparable to a reference population, have a clustered struc¬ 
ture, and so on. In chapter 4 we learned how calculate and then 
plot frequency tables of categorical variables, using some data 
from the General Social Survey. However, if we want accurate 
estimates of U.S. households from the GSS, we will need to take 
the survey’s design into account and use the survey weights pro¬ 
vided in the dataset. Thomas Lumley’s survey library provides a 
comprehensive set of tools for addressing these issues. The tools 
and the theory behind them are discussed in detail in Lumley 
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(2010), and an overview of the package is provided in Lumley 
(2004). While the functions in the survey package are straight¬ 
forward to use and return results in a generally tidy form, the 
package predates the tidyverse and its conventions by several 
years. This means we cannot use survey functions directly with 
dplyr. However, Greg Freedman Ellis has written a helper pack¬ 
age, srvyr, that solves this problem for us and lets us use the 
survey library’s functions within a data analysis pipeline in a 
familiar way. 

For example, the gss_lon data contains a small subset of 
measures from every wave of the GSS since its inception in 
1972. It also contains several variables that describe the design 
of the survey and provide replicate weights for observations 
in various years. These technical details are described in the 
GSS documentation. Similar information is typically provided by 
other complex surveys. Here we will use this design informa¬ 
tion to calculate weighted estimates of the distribution of edu¬ 
cational attainment by race for selected survey years from 1976 
to 2016. 

To begin, we load the survey and srvyr libraries. 

library(survey) 

library(srvyr) 

Next, we take our gss_lon dataset and use the survey tools to 
create a new object that contains the data, as before, but with some 
additional information about the survey’s design: 


options(survey.lonely.psu = "adjust") 
options(na.action="na.pass") 

gss_wt <- subset(gss_lon, year > 1974) l>l 

mutate(stratvar = interaction(year, vstrat)) 
as_survey_design(ids = vpsu, 

strata = stratvar, 
weights = wtssall, 
nest = TRUE) 


The two options set at the beginning provides some infor¬ 
mation to the survey library about how to behave. You should 
consult Lumley (2010) and the survey package documentation for 
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details. The subsequent operations create gss_wt, an object with 
one additional column (stratvar), describing the yearly sam¬ 
pling strata. We use the interaction () function to do this. It 
multiplies the vstrat variable by the year variable to get a vec¬ 
tor of stratum information for each year. In the next step, we use We have to do this because of the way the GSS 
the as_survey.design () function to add the key pieces of infor- codes its stratum information, 
mation about the survey design. It adds information about the 
sampling identifiers (ids), the strata (strata), and the replicate 
weights (weights). With those in place we can take advantage 
of a large number of specialized functions in the survey library 
that allow us to calculate properly weighted survey means or esti¬ 
mate models with the correct sampling specification. For exam¬ 
ple, we can calculate the distribution of education by race for 
a series of years from 1976 to 2016. We use survey.meanQ to 
do this: 


out_grp <- gss_wt l>l 

filter(year /.in'/. seq(l976, 2016, by = 4)) l>l 
group_by(year, race, degree) !.>'/. 
summarize(prop = survey_mean(na. rm = TRUE)) 

out_grp 


## # A tibble: 150 x 5 


## year race degree prop prop_se 

## <dbl> <fct> <fct> <dbl> <dbl> 


## 

1 

1976. 

White 

## 

2 

1976. 

White 

## 

3 

1976. 

White 

## 

4 

1976. 

White 

## 

5 

1976. 

White 

## 

6 

1976. 

Black 

## 

7 

1976. 

Black 

## 

8 

1976. 

Black 

## 

9 

1976. 

Black 

## 

10 

1976. 

Black 

## #. 

.. with 140 


Lt High School 
High School 
Junior College 
Bachelor 
Graduate 
Lt High School 
High School 
Junior College 
Bachelor 
Graduate 
more rows 


0.328 

0.0160 

0.518 

0.0162 

0.0129 

0.00298 

0.101 

0.00960 

0.0393 

0.00644 

0.562 

0.0611 

0.337 

0.0476 

0.0426 

0.0193 

0.0581 

0.0239 

0. 

0. 


The results returned in out_grp include standard errors. We 
can also ask survey.meanQ to calculate confidence intervals for 
us, if we wish. 
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Grouping with group_by() lets us calculate counts or means 
for the innermost variable, grouped by the next variable “up” or 
“out,” in this case, degree by race, such that the proportions for 
degree will sum to one for each group in race, and this will be 
done separately for each value of year. If we want the marginal 
frequencies, such that the values for all combinations of race and 
degree sum to one within each year, we first have to interact 
the variables we are cross-classifying. Then we group by the new 
interacted variable and do the calculation as before: 


out_mrg <- gss_wt l>l 

filter(year '/.in'/. seq(l976, 2016, by = 4)) l>l 
mutate(racedeg = interaction(race, degree)) '/.>l 
group_by(year, racedeg) 
summarizefprop = survey_mean(na.rm = TRUE)) 

out_mrg 


## # A tibble: 150 x 4 


## year racedeg 

## <dbl> <fct> 

## 1 1976. White.Lt High School 
## 2 1976. Black.Lt High School 
## 3 1976. Other.Lt High School 
## 4 1976. White.High School 
## 5 1976. Black.High School 
## 6 1976. Other.High School 
## 7 1976. White.Junior College 
## 8 1976. Black.Junior College 
## 9 1976. Other.Junior College 
## 10 1976. White.Bachelor 


prop 

prop.se 

<dbl> 


<dbl> 

0.298 

0 

0146 

0.0471 

0 

00840 

0.00195 

0 

00138 

0.471 

0 

0160 

0.0283 

0 

00594 

0.00325 

0 

00166 

0.0117 

0 

00268 

0.00357 

0 

00162 

0. 

0 


0.0919 

0 

00888 


## # ... with 140 more rows 


This gives us the numbers that we want and returns them in 
a tidy data frame. The interactionQ function produces vari¬ 
able labels that are a compound of the two variables we interacted, 
with each combination of categories separated by a period (such 
as White .Graduate). However, perhaps we would like to see these 
categories as two separate columns, one for race and one for edu¬ 
cation, as before. Because the variable labels are organized in a 
predictable way, we can use one of the convenient functions in the 




tidyverse’s tidyr package to separate the single variable into two 
columns while correctly preserving the row values. Appropriately, 
this function is called separate(). 
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out_mrg •<- gss_wt filter(year '/.ini seq(1976, 2016, by = 4)) '/>'/. 
mutate(racedeg = interaction(race, degree)) group_by(year, 
racedeg) summarize(prop = survey_mean(na.rm = TRUE)) 
separate(racedeg, sep = into = c("race", "degree")) 

out_mrg 


## # A tibble: 151 
## year race 
## <dbl> <chr> 
## 1 1976. White 
## 2 1976. Black 
## 3 1976. Other 
## 4 1976. White 
## 5 1976. Black 
## 6 1976. Other 
## 7 1976. White 
## 8 1976. Black 
## 9 1976. Other 
## 10 1976. White 
## # ... with 140 


I x 5 

degree 

<chr> 

Lt High School 
Lt High School 
Lt High School 
High School 
High School 
High School 
Junior College 
Junior College 
Junior College 
Bachelor 
more rows 


prop prop_se 
<dbl> <dbl> 
0.298 0.0146 

0.0471 0.00840 
0.00195 0.00138 
0.471 0.0160 

0.0283 0.00594 
0.00325 0.00166 
0.0117 0.00268 
0.00357 0.00162 
0 . 0 . 
0.0919 0.00888 


The two backslashes before the period in the call to 
separate are necessary for R to interpret it literally as 
a period. By default in search-and-replace operations 
like this, the search terms are regular expressions. 
The period acts as a special character, a kind of 
wildcard, meaning "any character at all." To make 
the regular expression engine treat it literally, we 
add one backslash before it. The backslash is an 
"escape" character. It means 'The next character is 
going to be treated differently from usual." However, 
because the backslash is a special character as well, 
we need to add a second backslash to make sure 
the parser sees it properly. 


The call to separateQ says to take the racedeg column, split 
each value when it sees a period, and reorganize the results into 
two columns, race and degree. This gives us a tidy table much 
like out_grp but for the marginal frequencies. 

Reasonable people can disagree over how best to plot a small 
multiple of a frequency table while faceting by year, especially 
when there is some measure of uncertainty attached. A barplot is 
the obvious approach for a single case, but when there are many 
years it can be difficult to compare bars across panels. This is espe¬ 
cially the case when standard errors or confidence intervals are 
used in conjunction with bars. This is sometimes called a “dyna¬ 
mite plot,” not because it looks amazing but because the t-shaped 
error bars on the tops of the columns make them look like car¬ 
toon dynamite plungers. An alternative is to use a line graph to 
join up the time observations, faceting on educational categories 


Sometimes it may be preferable to show that the 
underlying variable is categorical, as a bar chart 
makes clear, and not continuous, as a line graph 
suggests. Here the trade-off is in favor of the line 
graphs as the bars are hard to compare across facets. 




Educational attainment by race 

GSS1976-2016 


Figure 6.12: Weighted estimates of educational 
attainment for whites and blacks, GSS selected years 
1976-2016. Faceting barplots is often a bad idea, 
and the more facets there are, the worse an idea it is. 
With a small-multiple plot the viewer wants to 
compare across panels (in this case, over time), but 
this is difficult to do when the data inside the panels 
are categorical comparisons shown as bars (in this 
case, education level by group). 
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instead of year. Figure 6.12 shows the results for our GSS data in 
dynamite-plot form, where the error bars are defined as twice the 
standard error in either direction around the point estimate. 


p ggplot(data = subset(out_grp, race /.run/, "Other"), 
mapping = aes(x = degree, y = prop, 

ymin = prop - 2*prop_se, 
ymax = prop + 2*prop_se, 
fill = race, 
color = race, 
group = race)) 

dodge <- position_dodge(width=0.9) 

p + geom_col(position = dodge, alpha = 0.2) + 

geom_errorbar(position = dodge, width = 0.2) + 
scale_x_discrete(labels = scales::wrap_format(10)) + 
scale_y_continuous(labels = scales::percent) + 
scale_color_brewer(type = "qual", palette = "Dark2") + 
scale_fill_brewer(type = "qual", palette = "Dark2") + 
labs(title = "Educational Attainment by Race", 
subtitle = "GSS 1976-2016", 
fill = "Race", 
color = "Race", 
x = NULL, y = "Percent") + 
facet_wrap(~ year, ncol = 2) + 
theme(legend.position = "top") 


This plot has a few cosmetic details and adjustments that we 
will learn more about in chapter 8. As before, I encourage you to 
peel back the plot from the bottom, one instruction at a time, to see 
what changes. One useful adjustment to notice is the new call to 
the scales library to adjust the labels on the x-axis. The adjust¬ 
ment on the y-axis is familiar, scales::percent to convert the 
proportion to a percentage. On the x-axis, the issue is that sev¬ 
eral of the labels are rather long. If we do not adjust them they will 
print over one another. The scales:: wrap_format() function will 
break long labels into lines. It takes a single numerical argument 
(here 10) that is the maximum length a string can be before it is 
wrapped onto a new line. 
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Educational attainment by race 

GSS1976-2016 


Race — White |— | Black 

Lt high school 



High school 



Junior college 




Graduate 



Figure 6.13: Faceting by education instead. 


A graph like this is true to the categorical nature of the data, 
while showing the breakdown of groups within each year. But you 
should experiment with some alternatives. For example, we might 
decide that it is better to facet by degree category instead, and put 
the year on the x-axis within each panel. If we do that, then we 
can use geom_line() to show a time trend, which is more natu¬ 
ral, and geom_ribbon() to show the error range. This is perhaps 
a better way to show the data, especially as it brings out the time 
trends within each degree category and allows us to see the sim¬ 
ilarities and differences by racial classification at the same time 
(fig. 6.13). 

p <- ggplot(data = subset(out_grp, race y.niny. "Other"), 

mapping = aes(x = year, y = prop, ymin = prop - 2*prop_se, 

ymax = prop + 2*prop_se, fill = race, color = race, 
group = race)) 

p + geom_ribbon(alpha = 0.3, aesfcolor = NULL)) + 
geom_line() + 

facet_wrap(~ degree, ncol = 1) + 
scale_y_continuous(labels = scales::percent) + 
scale_color_brewer(type = "qual", palette = "Dark2") + 
scale_fill_brewer(type = "qual", palette = "Dark2") + 
labsftitle = "Educational Attainment by Race", 
subtitle = "GSS 1976-2016", fill = "Race", 
color = "Race", x = NULL, y = "Percent") + 
theme(legend.position = "top") 


6.9 Where to Go Next 

In general, when you estimate models and want to plot the results, 
the difficult step is not the plotting but rather calculating and 
extracting the right numbers. Generating predicted values and 
measures of confidence or uncertainty from models requires that 
you understand the model you are fitting and the function you use 
to fit it, especially when it involves interactions, cross-level effects, 
or transformations of the predictor or response scales. The details 
can vary substantially from model type to model type, and also 
with the goals of any particular analysis. It is unwise to approach 

















them mechanically. That said, several tools exist to help you 
work with model objects and produce a default set of plots from 
them. 
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Default plots for models 

Just as model objects in R usually have a default summary() 
method, printing out an overview tailored to the type of model 
it is, they will usually have a default plotQ method, too. Figures 
produced by plot() are typically not generated via ggplot, but it 
is usually worth exploring them. They typically make use of either 
R’s base graphics or the lattice library (Sarkar 2008). These are 
two plotting systems not covered in this book. Default plot meth¬ 
ods are easy to examine. Let’s take a look again at our simple OLS 
model. 


out lmfformula = lifeExp ~ log(gdpPercap) + pop + continent, 
data = gapminder) 


To look at some of R’s default plots for this model, use the 
plotQ function. 


# Plot not shown 

plot(out, which = c(1, 2), ask = FALSE) 


The which () statement here selects the first two of four default 
plots for this kind of model. If you want to easily reproduce base 
R’s default model graphics using ggplot, the ggfortify package is 
worth examining. It is similar to broom in that it tidies the out¬ 
put of model objects, but it focuses on producing a standard plot 
(or group of plots) for a wide variety of model types. It does this 
by defining a function called autoplotQ. The idea is to be able 
to use autoplotQ with the output of many different kinds of 
model. 

A second option worth looking at is the coefplot package. 
It provides a quick way to produce good-quality plots of point 
estimates and confidence intervals (fig. 6.14). It has the advan¬ 
tage of managing the estimation of interaction effects and other 
occasionally tricky calculations. 


Coefficient plot 



Figure 6.14: A plot from coefplot. 
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library(coefplot) 

out x- lmfformula = lifeExp ~ log(gdpPercap) + log(pop) + continent, 
data = gapminder) 

coefplot(out, sort = "magnitude", intercept = FALSE) 


Tools in development 

Tidyverse tools for modeling and model exploration are being 
actively developed. The broom and margins packages continue to 
get more and more useful. There are also other projects worth pay- 
infer.netiify.com ing attention to. The infer package is in its early stages but can 

already do useful things in a pipeline-friendly way. You can install 
it from CRAN with install. packages("infer"). 

Extensions to ggplot 

The GGally package provides a suite of functions designed to make 
producing standard but somewhat complex plots a little easier. For 
instance, it can produce generalized pairs plots, a useful way of 
quickly examining possible relationships between several different 
variables at once. This sort of plot is like the visual version of a cor¬ 
relation matrix. It shows a bivariate plot for all pairs of variables 
in the data. This is relatively straightforward when all the vari¬ 
ables are continuous measures. Things get more complex when, 
as is often the case in the social sciences, some or all variables are 
categorical or otherwise limited in the range of values they can 
take. A generalized pairs plot can handle these cases. For example, 
figure 6.15 shows a generalized pairs plot for five variables from 
the organdata dataset. 

library(GGally) 

organdata_sm <- organdata selectfdonors, pop_dens, pubhealth, 
roads, consent_law) 

ggpairsfdata = organdata_sm, mapping = aesfcolor = consent_law), 

upper = list(continuous = wrap("density"), combo = "box_no_facet"), 
lower = list(continuous = wrap("points"), combo = wrap("dot_no_facet"))) 




Figure 6.15: A generalized pairs plot made using the GGally library. 
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Multipanel plots like those in figure 6.15 are intrinsically very 
rich in information. When combined with several within-panel 
types of representation, or any more than a modest number of 
variables, they can become quite complex. They should be used 
sparingly for the presentation of finished work. More often they 
are a useful tool for a researcher to quickly investigate aspects of 
a data set. The goal is not to pithily summarize a single point one 
already knows but to open things up for further exploration. 



7 Draw Maps 


Choropleth maps show geographical regions colored, shaded, or 
graded according to some variable. They are visually striking, 
especially when the spatial units of the map are familiar enti¬ 
ties, like the countries of the European Union, or states in the 
United States. But maps like this can also sometimes be mis¬ 
leading. Although it is not a dedicated Geographical Information 
System (GIS), R can work with geographical data, and ggplot can 
make choropleth maps. But we’ll also consider some other ways of 
representing data of this kind. 

Figure 7.1 shows a series of maps of the U.S. presidential elec¬ 
tion results in 2016. Reading from the top left, we see, first, a 
state-level map where the margin of victory can be high (a darker 
blue or red) or low (a lighter blue or red). The color scheme has 
no midpoint. Second, we see a county-level map colored bright 
red or blue depending on the winner. Third is a county-level map 
where the color of red and blue counties is graded by the size of 
the vote share. Again, the color scale has no midpoint. Fourth is 
a county-level map with a continuous color gradient from blue to 
red, but that passes through a purple midpoint for areas where the 
balance of the vote is close to even. The map in the bottom left 
has the same blue-purple-red scheme but distorts the geographical 
boundaries by squeezing or inflating them to reflect the population 
of the county shown. Finally in the bottom right we see a carto- 
gram, where states are drawn using square tiles, and the number 
of tiles each state gets is proportional to the number of electoral 
college votes it has (which in turn is proportional to that state’s 
population). 

Each of these maps shows data for the same event, but the 
impressions they convey are very different. Each faces two main 
problems. First, the underlying quantities of interest are only 
partly spatial. The number of electoral college votes won and the 
share of votes cast within a state or county are expressed in spa¬ 
tial terms, but ultimately it is the number of people within those 
regions that matter. Second, the regions themselves are of wildly 







Figure 7.1:2016 U.S. election results maps of different kinds. Maps by Ali Zifan (panels 1-4,6) and Mark Newman (panel 5). 
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differing sizes, and they differ in a way that is not well correlated 
with the magnitudes of the underlying votes. The mapmakers also 
face choices that would arise in many other representations of the 
data. Do we want to just show who won each state in absolute terms 
(this is all that matters for the actual result, in the end) or do we 
want to indicate how close the race was? Do we want to display 
the results at some finer level of resolution than is relevant to the 
outcome, such as county rather than state counts? How can we 
convey that different data points can carry very different weights 
because they represent vastly larger or smaller numbers of people? 
It is tricky enough to convey these choices honestly with different 
colors and shape sizes on a simple scatterplot. Often a map is like 
a weird grid that you are forced to conform to even though you 
know it systematically misrepresents what you want to show. 

This is not always the case, of course. Sometimes our data 
really is purely spatial, and we can observe it at a fine enough level 
of detail that we can represent spatial distributions honestly and in 
a compelling way. But the spatial features of much social science 
are collected through entities such as precincts, neighborhoods, 
metro areas, census tracts, counties, states, and nations. These 
may themselves be socially contingent. A great deal of cartographic 
work with social-scientific variables involves working both with 
and against that arbitrariness. Geographers call this the Modifiable 
Areal Unit Problem, or MAUP (Openshaw 1983). 

7.1 Map U.S. State-Level Data 

Let’s take a look at some data for the U.S. presidential election in 
2016 and see how we might plot it in R. The election dataset has 
various measures of the vote and vote shares by state. Here we pick 
some columns and sample a few rows at random. 

election select(state, total_vote, 


r_points, pct_trump, party, census) '/.>!. 


sample_n(5) 


## # A tibble: 5x6 


## state 


total_vote r_points pct_trump party census 
<dbl> <dbl> <dbl> <chr> <chr> 
1924149. 29.8 62.5 Republican South 


## <chr> 

## 1 Kentucky 
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Midwest 


North Dakota - 
South Dakota - 
Nebraska - 
Kansas - 
Indiana - 
Missouri - 
Iowa - 
Ohio - 
Wisconsin - 
Michigan - 
Minnesota - 
Illinois - 




Northeast 


Pennsylvania - 
New Hampshire - 
Maine - 
Connecticut - 
New Jersey - 
Rhode Island - 
New York - 
Vermont - 
Massachusetts - 



» 

O 


South 


West Virginia - 
Oklahoma - 
kentucky - 
Alabama - 
Arkansas - 
Tennessee - 
Louisiana - 
Mississippi - 
South Carolina - 
Texas - 
Georgia - 
North Carolina - 
Florida - 
Virginia - 
Delaware - 
Maryland - 





Wyoming 
Idaho 
Montana 
Utah 
Alaska 
Arizona 
Nevada 
Colorado 
New Mexico 
Oregon 
Washington 
California 
Hawaii 


Figure 7.2:2016 election results. Would a two-color 
choropleth map be more informative than this, or 
less? 


West 


30 20 10 0 10 20 30 40 

(Clinton) (Trump) 

Point margin 


## 2 Vermont 

315067. 

-26.4 

30.3 Democrat 

Northeast 

## 3 South Carolina 

2103027. 

14.3 

54.9 Republican 

South 

## 4 Wyoming 

255849. 

46.3 

68.2 Republican 

West 

## 5 Kansas 

1194755. 

20.4 

56.2 Republican 

Midwest 


The FIPS code is a federal code that numbers states and ter¬ 
ritories of the United States. It extends to the county level with an 
additional four digits, so every U.S. county has a unique six-digit 
identifier, where the first two digits represent the state. This dataset 
also contains the census region of each state. 

# Hex color codes for Dem Blue and Rep Red 
party.colors 4- c("#2E74C0", "#CB454A") 

p0 -r- ggplot(data = subset(election, st y.niny, "DC"), 
mapping = aes(x = r_points, 

y = reorder(state, r_points), 
color = party)) 

pi <- p0 + geom_vline(xintercept = 0, color = "gray30") + 
geom_point(size = 2) 

p2 pi + scale_color_manual(values = party_colors) 

p3 <— p2 + scale_x_continuous(breaks = c(-30, -20, -10, 0, 10, 20, 30, 40), 

labels = c("30\n (Clinton)", "20", "10", "0", 
"10", "20", "30", "40\n(Trump)")) 

p3 + facet_wrap(~ census, ncol=1, scales="free_y") + 

guides(color=FALSE) + labs(x = "Point Margin", y = "") + 
theme(axis.text=element_text(size=8)) 


The first thing you should remember about spatial data is that 
you don’t have to represent it spatially We’ve been working with 
country-level data throughout and have yet to make a map of it. 
Of course, spatial representations can be very useful, and some¬ 
times absolutely necesssary. But we can start with a state-level 
dotplot, faceted by region (fig. 7.2). This figure brings together 
many aspects of plot construction that we have worked on so 
far, including subsetting data, reordering results by a second 
variable, and using a scale formatter. It also introduces some 
new options, like manually setting the color of an aesthetic. We 
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break up the construction process into several steps by creating 
intermediate objects (p0, pi, p2) along the way. This makes the 
code more readable. Bear in mind also that, as always, you can 
try plotting each of these intermediate objects as well (just type 
their name at the console and hit return) to see what they look 
like. What happens if you remove the scales="f ree_y" argu¬ 
ment to facet_wrap()? What happens if you delete the call to 
scale_color_manual()? 

As always, the first task in drawing a map is to get a data frame 
with the right information in it, and in the right order. First we 
load R’s maps package, which provides us with some predrawn 
map data. 


library(maps) 

us_states <- map_data("state") 
head(us_states) 

## long lat group order region subregion 


## 1 -87.4620 30.3897 

1 

1 alabama 

<NA> 

## 2 -87.4849 30.3725 

1 

2 alabama 

<NA> 

## 3 -87.5250 30.3725 

1 

3 alabama 

<NA> 

## 4 -87.5308 30.3324 

1 

4 alabama 

<NA> 

## 5 -87.5709 30.3267 

1 

5 alabama 

<NA> 

## 6 -87.5881 30.3267 

1 

6 alabama 

<NA> 


dim(us_states) 

## [1] 15537 6 


This just a data frame. It has more than 15,000 rows because 
you need a lot of lines to draw a good-looking map. We can make a 
blank state map right away with this data, using geom_polygon(). 


p <- ggplot(data = us_states, mapping = aes(x = long, y = lat, 
group = group)) 

p + geom_polygon(fill = "white", color = "black") 


The map in figure 7.3 is plotted with latitude and longitude 
points, which are there as scale elements mapped to the x- and 
y-axes. A map is, after all, just a set of lines drawn in the right 
order on a grid. 



Long 


Figure 7.3: A first U.S. map. 
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50 -T 



-120 -100 -80 

Long 

Figure 7.4: Coloring the states. 



We can map the fill aesthetic to region and change the color 
mapping to a light gray and thin the lines to make the state borders 
a little nicer (fig. 7.4). We’ll also tell R not to plot a legend. 

p ■«- ggplot(data = us_states, aes(x = long, y = lat, group = group, 
fill = region)) 

p + geom_polygon(color = "gray90", size = 0.1) + guides(fill = FALSE) 

Next, lets deal with the projection. By default the map is plot¬ 
ted using the venerable Mercator projection. It doesn’t look that 
good. Assuming we are not planning on sailing across the Atlantic, 
the practical virtues of this projection are not much use to us, 
either. If you glance again at the maps in figure 7.1, you’ll notice 
they look nicer. This is because they are using an Albers projection. 
(Look, for example, at the way that the U.S.-Canadian border is a 
little curved along the 49th parallel from Washington state to Min¬ 
nesota, rather than a straight line.) Techniques for map projection 
are a fascinating world of their own, but for now just remember we 
can transform the default projection used by geom_polygon() via 
the coorcLmapO function. Remember that we said that projection 
onto a coordinate system is a necessary part of the plotting process 
for any data. Normally it is left implicit. We have not usually had to 
specify a coo rcL function because most of the time we have drawn 
our plots on a simple Cartesian plane. Maps are more complex. 
Our locations and borders are defined on a more or less spherical 
object, which means we must have a method for transforming or 
projecting our points and lines from a round to a flat surface. The 
many ways of doing this give us a menu of cartographic options. 

The Albers projection requires two latitude parameters, lat0 
and latl . We give them their conventional values for a U.S. map 
here (fig. 7.5). (Try messing around with their values and see what 
happens when you redraw the map.) 


ggplot(data = us_states, 

mapping = aes(x = long, y = lat, 

group = group, fill = region)) 


+ geom_polygon(color = "gray90", size = 0.1) + 
coord_map(projection = "albers", lat0 = 39, latl = 45) + 
guides(flll = FALSE) 
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Now we need to get our own data onto the map. Remem¬ 
ber, underneath that map is just a big data frame specifying a 
large number of lines that need to be drawn. We have to merge 
our data with that data frame. Somewhat annoyingly, in the map 
data the state names (in a variable named region) are in lower¬ 
case. We can create a variable in our own data frame to corre¬ 
spond to this, using the tolower () function to convert the state 
names. We then use left_join to merge, but you could also 
use merge(..., sort = FALSE). This merge step is important! 
You need to take care that the values of the key variables you 
are matching on really do exactly correspond to one another. If 
they do not, missing values (NA codes) will be introduced into 
your merge, and the lines on your map will not join up. This 
will result in a weirdly “sharded” appearance to your map when 
R tries to fill the polygons. Here, the region variable is the only 
column with the same name in both the data sets we are join¬ 
ing, and so the left_join() function uses it by default. If the 
keys have different names in each data set, you can specify that if 
needed. 

To reiterate, it is important to know your data and variables 
well enough to check that they have merged properly. Do not 
do it blindly. For example, if rows corresponding to Washing¬ 
ton, DC, were named “Washington dc” in the region variable 
of your election data frame but “district of Columbia” in the 
corresponding region variable of your map data, then merg¬ 
ing on region would mean no rows in the election data frame 
would match “Washington dc” in the map data, and the result¬ 
ing merged variables for those rows would all be coded as miss¬ 
ing. Maps that look broken when you draw them are usually 
caused by merge errors. But errors can also be subtle. For exam¬ 
ple, perhaps one of your state names inadvertently has a leading 
(or, worse, a trailing) space as a result of the data originally 
being imported from elsewhere and not fully cleaned. That would 
mean, for example, that California and California are dif¬ 
ferent strings, and the match would fail. In ordinary use you 
might not easily see the extra space (designated here by ,_,). So be 
careful. 


election$region tolower(election$state) 
us_states_elec left_join(us_states, election) 
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We have now merged the data. Take a look at the object with 
head(us_states_elec). Now that everything is in one big data 
frame, we can plot it on a map (fig. 7.6). 


Figure 7.6: Mapping the results. 


p ggplot(data = us_states_elec, 
aes(x = long, y = lat, 

group = group, fill = party)) 

p + geom_polygon(color = "gray90", size = 0.1) + 

coord_map(projection = "albers", lat0 = 39, latl = 45) 



Election Results 2016 



Figure 7.7: Election 2016 by state. 


Trump vote 



Figure 7.8: Two versions of percent Trump by state. 


To complete the map (fig. 7.7), we will use our party colors 
for the fill, move the legend to the bottom, and add a title. Finally 
we will remove the grid lines and axis labels, which aren’t really 
needed, by defining a special theme for maps that removes most 
of the elements we don’t need. (We’ll learn more about themes 
in chapter 8. You can also see the code for the map theme in the 
appendix.) 


p0 -r- ggplot(data = us_states_elec, 

mapping = aes(x = long, y = lat, 

group = group, fill = party)) 
pi ^ p0 + geom_polygon(color = "gray90", size = 0.1) + 
coord_map(projection = "albers", lat0 = 39, latl = 45) 
p2 <— pi + scale_flll_manual(values = party_colors) + 
labs(title = "Election Results 2016", fill = NULL) 
p2 + theme_map() 


With the map data frame in place, we can map other variables 
if we like. Let’s try a continuous measure, such as the percentage 
of the vote received by Donald Trump. To begin with, in figure 7.8 
we just map the variable we want (pct_trump) to the fill aesthetic 
and see what geom_polygon() does by default. 

p0 4- ggplot(data = us_states_elec, 

mapping = aes(x = long, y = lat, group = group, fill = pct_trump)) 

pi ^ p0 + geom_polygon(color = "gray90", size = 0.1) + 
coord_map(projection = "albers", lat0 = 39, latl = 45) 

pi + labs(title = "Trump vote") + theme_map() + labs(fill = "Percent") 
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p2 <- pi + scale_fill_gradient(low = "white", high = "#CB454A") + 
labs(title = "Trump vote") 
p2 + theme_map() + labs(flll = "Percent") 

The default color used in the pi object is blue. Just for reasons 
of convention, that isn’t what is wanted here. In addition, the gra¬ 
dient runs in the wrong direction. In our case, the standard inter¬ 
pretation is that a higher vote share makes for a darker color. We 
fix both of these problems in the p2 object by specifying the scale 
directly. We’ll use the values we created earlier in party_colors. 

For election results, we might prefer a gradient that diverges 
from a midpoint. The scale_gradient2() function gives us a 
blue-red spectrum that passes through white by default. Alterna¬ 
tively, we can respecify the midlevel color along with the high and 
low colors. We will make purple our midpoint and use the muted () 
function from the scales library to tone down the color a little. 

p0 <- ggplot(data = us_states_elec, 

mapping = aes(x = long, y = lat, group = group, fill = d_points)) 

pi •«- p0 + geom_polygon(color = "gray90", size = 0.1) + 
coord_map(projection = "albers", lat0 = 39, latl = 45) 

p2 pi + scale_fill_gradient2() + labs(title = "Winning margins") 
p2 + theme_map() + labs(fill = "Percent") 

p3 ^— pi + scale_fill_gradient2(low = "red", mid = scales: :muted("purple"), 

high = "blue", breaks = c(-25, 0, 25, 50, 75)) + 
labs(title = "Winning margins") 
p3 + theme_map() + labs(fill = "Percent") 


If you look at the gradient scale for this first “purple America” 
map, in figure 7.9, you’ll see that it extends very high on the blue 
side. This is because Washington, DC, is included in the data and 
hence the scale. Even though it is barely visible on the map, DC has 
by far the highest points margin in favor of the Democrats of any 
unit of observation in the data. If we omit it, we’ll see that our scale 
shifts in a way that does not just affect the top of the blue end but 
recenters the whole gradient and makes the red side more vivid as 
a result. Figure 7.10 shows the result. 
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Winning margins 



Winning margins 



Figure 7.9: Two views of Trump vs Clinton share: a 
white midpoint and a purple America version. 


p0 •<- ggplot(data = subset(us_states_elec, 

region /.nin/. "district of Columbia"), 
aes(x = long, y = lat, group = group, fill = d_points)) 

pi <- p0 + geom_polygon(color = "gray90", size = 0.1) + 
coord_map(projection = "albers", lat0 = 39, latl = 45) 

p2 pi + scale_fill_gradient2(low = "red", 

mid = scales::muted("purple"), 
high = "blue") + 
labs(title = "Winning margins") 
p2 + theme_map() + labs (fill = "Percent") 

This brings out the familiar choropleth problem of having geo¬ 
graphical areas that only partially represent the variable we are 
mapping. In this case, were showing votes spatially, but what really 
matters is the number of people who voted in each state. 

7.2 America's Ur-choropleths 


Winning margins 



Figure 7.10: A purple America version of Trump vs 
Clinton that excludes results from Washington, DC. 


In the U.S. case, administrative areas vary widely in geographical 
area and also in population size. The modifiable areal unit prob¬ 
lem evident at the state level, as we have seen, also arises even 
more at the county level. County-level U.S. maps can be aestheti¬ 
cally pleasing because of the added detail they bring to a national 
map. But they also make it easy to present a geographical distribu¬ 
tion to insinuate an explanation. The results can be tricky to work 
with. When producing county maps, it is important to remember 
that the states of New Hampshire, Rhode Island, Massachussetts, 
and Connecticut are all smaller in area than any of the ten largest 
Western counties. Many of those counties have fewer than a hun¬ 
dred thousand people living in them. Some have fewer than ten 
thousand inhabitants. 

The result is that most choropleth maps of the United States 
for whatever variable in effect show population density more than 
anything else. The other big variable, in the U.S. case, is percent 
black. Let’s see how to draw these two maps in R. The proce¬ 
dure is essentially the same as it was for the state-level map. We 
need two data frames, one containing the map data, and the other 
containing the fill variables we want plotted. Because there are 
more than three thousand counties in the United States, these two 
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data frames will be rather larger than they were for the state-level 
maps. 

The datasets are included in the socviz library The county 
map data frame has already been processed a little in order to 
transform it to an Albers projection, and also to relocate (and 
rescale) Alaska and Hawaii so that they fit into an area in the bot¬ 
tom left of the figure. This is better than throwing away two states 
from the data. The steps for this transformation and relocation 
are not shown here. If you want to see how it’s done, consult the 
appendix for details. Let’s take a look at our county map data first: 


county_map sample_n(5) 


## long lat order hole piece group id 


## 116977 -286097 -1302531 116977 FALSE 
## 175994 1657614 -698592 175994 FALSE 
## 186409 674547 -65321 186409 FALSE 
## 22624 619876 -1093164 22624 FALSE 
## 5906 -1983421 -2424955 5906 FALSE 


1 0500000US35025.1 35025 
1 0500000US51197.1 51197 
1 0500000US55011.1 55011 
1 0500000US05105.1 05105 
10 0500000US02016.10 02016 


It looks the same as our state map data frame, but it is much 
larger, running to almost 200,000 rows. The id field is the FIPS 
code for the county. Next, we have a data frame with county-level 
demographic, geographic, and election data: 


county_data 

selected, name, state, pop_dens, pct_black) 
sample_n(5) 


#1 id 


name state 


## 3029 53051 Pend Oreille County 


## 1851 35041 
## 1593 29165 
## 2363 45009 
## 654 17087 


Roosevelt County 
Platte County 
Bamberg County 
Johnson County 


WA [ 
NM [ 
MO [ 
SC [ 
IL [ 


pop_dens pct_black 

0 , 10 ) [ 0 . 0 , 2 . 0 ) 

0, 10) [ 2.0, 5.0) 

100, 500) [ 5.0,10.0) 

10, 50) [50.0,85.3] 

10, 50) [ 5.0,10.0) 


This data frame includes information for entities besides 
counties, though not for all variables. If you look at the top of the 
object with head(), you’ll notice that the first row has an id of 0. 
Zero is the FIPS code for the entire United States, and thus the data 
in this row are for the whole country. Similarly, the second row has 
an id of 01000, which corresponds to the state FIPS of 01, for the 
whole of Alabama. As we merge county_data in to county_map, 
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Figure 7.11: U.S. population density by county. 






Population 

aquare mile 0-10 10-50 50-100 100-500 1 500-1,000 11,000-5,000 | >5,000 

these state rows will be dropped, along with the national row, as 
county_map only has county-level data. 

We merge the data frames using the shared FIPS id column: 

county_full left_join(county_map, county_data, by = "id") 

With the data merged, we can map the population density per 
square mile (fig. 7.11). 

p ggplot(data = county_full, 

mapping = aes(x = long, y = lat, 
fill = pop_dens, 
group = group)) 

pi p + geom_polygon(color = "gray90", size = 0.05) + coord_equal() 

p2 ^— pi + scale_fill_brewer(palette="Blues", 

labels = c("0-10", "10-50", "50-100", "100-500", 
"500-1,000", "1,000-5,000", ">5,000")) 


p2 + labs (fill = "Population per\nsquare mile") + 
theme_map() + 

guides(fill = guide_legend(nrow = 1)) + 
theme(legend.position = "bottom") 
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If you try out the pi object, you will see that ggplot produces 
a legible map, but by default it chooses an unordered categorical 
layout. This is because the pop_dens variable is not ordered. 
We could recode it so that R is aware of the ordering. Alterna¬ 
tively, we can manually supply the right sort of scale using the 
scale_fill_brewer() function, together with a nicer set of labels. 
We will learn more about this scale function in the next chapter. 
We also tweak how the legend is drawn using the guides () func¬ 
tion to make sure each element of the key appears on the same 
row. Again, we will see this use of guides () in more detail in the 
next chapter. The use of coord_equal() makes sure that the rela¬ 
tive scale of our map does not change even if we alter the overall 
dimensions of the plot. 

We can now do exactly the same thing for our map of percent 
black population by county (fig. 7.12). Once again, we specify a 
palette for the fill mapping using scale_fill_brewer(), this time 
choosing a different range of hues for the map. 

P ■*- ggplot(data = county_full, 

mapping = aes(x = long, y = lat, fill = pct_black, 
group = group)) 

pi <- p + geom_polygon(color = "gray90", size = 0.05) + coord_equal() 
p2 <- pi + scale_fill_brewer(palette="Greens") 
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Figure 7.12: Percent black population by county. 
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Gun-related suicides, 1999-2015 Reverse-coded population density 



Figure 7.13: Gun-related suicides by county; reverse-coded population density by county. Before tweeting this picture, please read the text for 
discussion of what is wrong with it. 


p2 + labs(fill = "US Population, Percent Black") + 
guides(fill = guide_legend(nrow = 1)) + 
theme_map() + theme(legend.position = "bottom") 

Figures 7.11 and 7.12 are Americas “ur-choropleths.” Between 
the two of them, population density and percent black will do a 
lot to obliterate many a suggestively patterned map of the United 
States. These two variables aren’t explanations of anything in iso¬ 
lation, but if it turns out that it is more useful to know one or both 
of them instead of the thing you’re plotting, you probably want to 
reconsider your theory. 

As an example of the problem in action, let’s draw two 
new county-level choropleths (fig. 7.13). The first is an effort to 
replicate a poorly sourced but widely circulated county map of 
firearm-related suicide rates in the United States. The su_gun6 
variable in county_data (and county_full) is a measure of 
the rate of all firearm-related suicides between 1999 and 2015. 
The rates are binned into six categories. We have a pop_dens6 
variable that divides the population density into six categories, 
too. 

We first draw a map with the su_gun6 variable. We will match 
the color palettes between the maps, but for the population map 
we will flip our color scale around so that less populated areas are 
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shown in a darker shade. We do this by using a function from the 
RColorBrewer library to manually create two palettes. The rev() 
function used here reverses the order of a vector. 


orange_pal ■*- RColorBrewer: :brewer.pal(n = 6, 
orange_pal 

name = "Oranges") 

## [1] "#FEEDDE" 
## [6] "#A63603" 

"#FDD0A2" "#EDAE6B" 

"#FD8D3C" 

"#E6550D" 

orange_rev ■*- rev 

orange_rev 

'(orange_pal) 



## [1] "#A63603" 
## [6] "ttFEEDDE" 

"#E6550D" "#FD8D3C" 

"#FDAE6B" 

"#FDD0A2" 


The brewer.pal () function produces evenly spaced color 
schemes to order from any one of several named palettes. The 
colors are specified in hexadecimal format. Again, we will learn 
more about color specifications and how to manipulate palettes for 
mapped variables in chapter 8. 

gun_p ggplot(data = county_full, 

mapping = aes(x = long, y = lat, 
fill = su_gun6, 
group = group)) 

gun_p1 <- gun_p + geom_polygon(color = "gray90", size = 0.05) + coord_equal() 

gun_p2 <- gun_p1 + scale_fill_manual(values = orange_pal) 

gun_p2 + labs(title = "Gun-Related Suicides, 1999-2015", 
fill = "Rate per 100,000 pop.") + 
theme_map() + theme(legend.position = "bottom") 


Having drawn the gun plot, we use almost exactly the same 
code to draw the reverse-coded population density map. 

pop_p ggplot(data = county_full, mapping = aes(x = long, y = lat, 

fill = pop_dens6, 
group = group)) 
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pop_p1 •<- pop_p + geom_polygon(color = "gray90", size = 0.05) + coord_equal() 

pop_p2 <- pop_p1 + scale_fill_manual(values = orange_rev) 

pop_p2 + labs(title = "Reverse-coded Population Density", 
fill = "People per square mile") + 
theme_map() + theme(legend.position = "bottom") 


It’s clear that the two maps are not identical. However, the 
visual impact of the first has a lot in common with the second. 
The dark bands in the West (except for California) stand out, and 
they fade as we move toward the center of the country. There are 
some strong similarities elsewhere on the map too, such as in the 
Northeast. 

The gun-related suicide measure is already expressed as a rate. 
It is the number of qualifying deaths in a county, divided by that 
county’s population. Normally, we standardize in this way to “con¬ 
trol for” the fact that larger populations will tend to produce more 
gun-related suicides just because they have more people in them. 
However, this sort of standardization has its limits. In particular, 
when the event of interest is not very common, and there is very 
wide variation in the base size of the units, then the denominator 
(e.g., the population size) starts to be expressed more and more in 
the standardized measure. 

Third, and more subtly, the data is subject to reporting con¬ 
straints connected to population size. If there are fewer than ten 
events per year for a cause of death, the Centers for Disease Control 
(CDC) will not report them at the county level because it might be 
possible to identify particular deceased individuals. Assigning data 
like this to bins creates a threshold problem for choropleth maps. 
Look again figure 7.13. The gun-related suicides panel seems to 
show a north-south band of counties with the lowest rate of sui¬ 
cides running from the Dakotas down through Nebraska, Kansas, 
and into West Texas. Oddly, this band borders counties in the West 
with the very highest rates, from New Mexico on up. But from the 
density map we can see that many counties in both these regions 
have very low population densities. Are they really that different 
in their gun-related suicide rates? 

Probably not. More likely, we are seeing an artifact arising 
from how the data is coded. For example, imagine a county with 
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100,000 inhabitants that experiences nine gun-related suicides in 
a year. The CDC will not report this number. Instead it will be 
coded as “suppressed,” accompanied by a note saying any stan¬ 
dardized estimates or rates will also be unreliable. But if we are 
determined to make a map where all the counties are colored in, 
we might be tempted to put any suppressed results into the low¬ 
est bin. After all, we know that the number is somewhere between 
zero and ten. Why not just code it as zero? Meanwhile, a county 
with 100,000 inhabitants that experiences twelve gun-related sui¬ 
cides ayear willbe numerically reported. The CDC is a responsible 
organization, and so although it provides the absolute number of 
deaths for all counties above the threshold, the notes to the data file 
will still warn you that any rate calculated with this number will be 
unreliable. If we do it anyway, then twelve deaths in a small pop¬ 
ulation might well put a sparsely populated county in the highest 
category of suicide rate. Meanwhile, a low-population county just 
under that threshold would be coded as being in the lowest (light¬ 
est) bin. But in reality they might not be so different, and in any 
case efforts to quantify that difference will be unreliable. If esti¬ 
mates for these counties cannot be obtained directly or estimated 
with a good model, then it is better to drop those cases as miss¬ 
ing, even at the cost of your beautiful map, than have large areas 
of the country painted with a color derived from an unreliable 
number. 

Small differences in reporting, combined with coarse binning 
and miscoding, will produce spatially misleading and substan¬ 
tively mistaken results. It might seem that focusing on the details 
of variable coding in this particular case is a little too much 
in the weeds for a general introduction. But it is exactly these 
details that can dramatically alter the appearance of any graph, 
and especially maps, in a way that can be hard to detect after 
the fact. 


Do not do this. One standard alternative is to 
estimate the suppressed observations using a count 
model. An approach like this might naturally lead to 
more extensive, properly spatial modeling of the 
data. 


7.3 Statebins 

As an alternative to state-level choropleths, we can consider state- 
bins, using a package developed by Bob Rudis. We will use it 
to look again at our state-level election results. Statebins is sim¬ 
ilar to ggplot but has a slightly different syntax from the one 
we’re used to. It needs several arguments, including the basic 
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Percent Trump 


20 40 60 
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data frame (the state_data argument), a vector of state names 
(state_col), and the value being shown (value_col). In addition, 
we can optionally tell it the color palette we want to use and the 
color of the text to label the state boxes. For a continuous vari¬ 
able we can use statebins_continuous(), as follows, to make 
figure 7.14: 
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library(statebins) 

statebins_continuous(state_data = election, state_col = "state", 

text_color = "white", value_col = "pct_trump", 
brewer_pal="Reds", font_size = 3, 
legend_title="Percent Trump") 

statebins_continuous(state_data = subset(election, st /.nin/. "DC"), 
state_col = "state", 

text_color = "black", value_col = "pct_clinton", 
brewer_pal="Blues", font_size = 3, 
legend_title="Percent Clinton") 


Figure 7.14: Statebins of the election results. We 
omit DC from the Clinton map to prevent the scale 
becoming unbalanced. 





Winner 



Percent Trump 4-21 £ 21-37 | 37-53 | 53-70 


Sometimes we will want to present categorical data. If our vari¬ 
able is already cut into categories we can use statebins_manual() 
to represent it. Here we add a new variable to the election data 
called color, just mirroring party names with two appropriate 
color names. We do this because we need to specify the colors we 
are using by way of a variable in the data frame, not as a proper 
mapping. We tell the statebins_manual() function that the col¬ 
ors are contained in a column named color and use it for the first 
map in figure 7.15. 

Alternatively, we can have statebins () cut the data using the 
breaks argument, as in the second plot in figure 7.15. 
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election election mutate(color = recode(party, Republican = "darkred", 

Democrat = "royalblue")) 

statebins_manual(state_data = election, state_col = "st", 
color_col = "color", text_color = "white", 
font_size = 3, legend_title="Winner", 
labels=c("Trump", "Clinton"), legend_position = "right") 


Figure 7.15: Manually specifying colors for statebins. 
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statebins(state_data = election, 

state_col = "state", value_col = "pct_trump", 

text_color = "white", breaks = 4, 

labels = c("4-21", "21-37", "37-53", "53-70"), 

brewer_pal="Reds", font_size = 3, legend_title="Percent Trump") 


7.4 Small-Multiple Maps 

Sometimes we have geographical data with repeated observations 
over time. A common case is to have a country- or state-level mea¬ 
sure observed over a period of years. In these cases, we might 
want to make a small-multiple map to show changes over time. 
For example, the opiates data has state-level measures of the 
death rate from opiate-related causes (such as heroin or fentanyl 
overdoses) between 1999 and 2014. 


opiates 


## # A tibble: 800 x 11 


## 


year 

state 

ftps deaths population crude adjusted 

## 


<int> 

<chr> 

<int> 

<int> 

<int> 

<dbl> 

<dbl> 

## 

1 

1999 

Alabama 

1 

37 

4430141 

0.800 

0.800 

## 

2 

1999 

Alaska 

2 

27 

624779 

4.30 

4.00 

## 

3 

1999 

Arizona 

4 

229 

5023823 

4.60 

4.70 

## 

4 

1999 

Arkansas 

5 

28 

2651860 

1.10 

1.10 

## 

5 

1999 

California 

6 

1474 

33499204 

4.40 

4.50 

## 

6 

1999 

Colorado 

8 

164 

4226018 

3.90 

3.70 

## 

7 

1999 

Connecticut 

9 

151 

3386401 

4.50 

4.40 

## 

8 

1999 

Delaware 

10 

32 

774990 

4.10 

4.10 

## 

9 

1999 

District o~ 

11 

28 

570213 

4.90 

4.90 

## 

10 

1999 

Florida 

12 

402 

15759421 

2.60 

2.60 


## # ... with 790 more rows, and 4 more variables: 

## # adjusted_se <dbl>, region <ord>, abbr <chr>, 

## # division_name <chr> 

As before, we can take our us_states object, the one with 
the state-level map details, and merge it with our opiates dataset. 
As before, we convert the state variable in the opiates data to 
lowercase first, to make the match work properly. 
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opiates$region •<- tolower(opiates$state) 
opiates_map •<- left_join(us_states, opiates) 

Because the opiates data includes the year variable, we are 
now in a position to make a faceted small-multiple with one map 
for each year in the data. The following chunk of code is similar 
to the single state-level maps we have drawn so far. We specify the 
map data as usual, adding geom_polygon() and coord_map() to 
it, with the arguments those functions need. Instead of cutting our 
data into bins, we will plot the continuous values for the adjusted 
if you want to experiment with cutting the data in to death rate variable (ad j usted) directly. To help plot this variable 

groups,takealookatthecut_int@rvai()function. effectively, we will use a new scale function from the viridis 

library. The viridis colors run in low-to-high sequences and do 
a very good job of combining perceptually uniform colors with 
easy-to-see, easily contrasted hues along their scales. The viridis 
library provides continuous and discrete versions, both in sev¬ 
eral alternatives. Some balanced palettes can be a little washed 
out at their lower end, especially, but the viridis palettes avoid 
this. In this code, the _c suffix in the scale_fill_viridis_c() 
function signals that it is the scale for continuous data. There is 
a scale_fill_viridis_d() equivalent for discrete data. 

We facet the maps just like any other small-multiple with 
facet_wrap(). We use the theme () function to put the legend 
at the bottom and remove the default shaded background from 
the year labels. We will learn more about this use of the theme () 
function in chapter 8. The final map is shown in figure 7.16. 


library(viridis) 

p0 <- ggplot(data = subset(opiates_map, year > 1999), 
mapping = aes(x = long, y = lat, 
group = group, 
fill = adjusted)) 

pi •«- p0 + geom_polygon(color = "gray90", size = 0.05) + 
coord_map(projection = "albers", lat0 = 39, latl = 45) 

p2 pi + scale_fill_viridis_c(option = "plasma") 

p2 + theme_map() + facet_wrap(~ year, ncol = 3) + 



Opiate related deaths by state, 2000-2014 
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theme(legend.position = "bottom", 

strip.background = element_blank()) + 
labs(fill = "Death rate per 100,000 population ", 

title = "Opiate Related Deaths by State, 2000-2014") 


Try revisiting your code for the ur-choropleths, but 
use continuous rather than binned measures, as well 
as the viridis palette. Instead ofpct.black, use the 
black variable. For the population density, divide 
pop by land_area. You will need to adjust the 
scale, functions. How do the maps compare to the 
binned versions? What happens to the population 
density map, and why? 


Is this a good way to visualize this data? As we discussed above, 
choropleth maps of the United States tend to track first the size of 
the local population and secondarily the percent of the popula¬ 
tion that is African American. The differences in the geographical 
size of states makes spotting changes more difficult again. And it 
is quite difficult to compare repeatedly across spatial regions. The 
repeated measures do mean that some comparison is possible, and 
the strong trends for this data make things a little easier to see. In 
this case, a casual viewer might think, for example, that the opioid 
crisis was worst in the desert Southwest in comparison to many 
other parts of the country, although it also seems that something 
serious is happening in the Appalachians. 


7.5 Is Your Data Really Spatial? 



Figure 7.17: All the states at once. 


As we noted at the beginning of the chapter, even if our data is 
collected via or grouped into spatial units, it is always worth ask¬ 
ing whether a map is the best way to present it. Much county, 
state, and national data is not properly spatial, insofar as it is really 
about individuals (or some other unit of interest) rather than the 
geographical distribution of those units per se. Let’s take our state- 
level opiates data and redraw it as a time-series plot. We will keep 
the state-level focus (these are state-level rates, after all) but try to 
make the trends more directly visible. 

We could just plot the trends for every state, as we did at the 
very beginning with the gapminder data. But fifty states (as in 
fig. 7.17) is too many lines to keep track of at once. 


p ggplot(data = opiates, mapping = aes(x = year, y = adjusted, 
group = state)) 

p + geom_line(color = "gray70") 


A more informative approach is to take advantage of the geo¬ 
graphical structure of the data by using the census regions to group 
the states. Imagine a faceted plot showing state-level trends within 
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each region of the country, perhaps with a trend line for each 
region. To do this, we will take advantage of ggplot’s ability to layer 
geoms one on top of another, using a different dataset in each case. 
We begin by taking the opiates data (removing Washington, DC, 
as it is not a state) and plotting the adjusted death rate over time. 


p0 •«- ggplot(data = drop_na(opiates, division_name), 
mapping = aes(x = year, y = adjusted)) 

pi <- p0 + geom_line(color = "gray70", 

mapping = aes(group = state)) 


The drop_na() function deletes rows that have observations 
missing on the specified variables, in this case just di vision_name, 
because Washington, DC, is not part of any census division. We 
map the group aesthetic to state in geom_line(), which gives us 
a line plot for every state. We use the color argument to set the 
lines to a light gray. Next we add a smoother: 


p2 <- pi + geom_smooth(mapping = aes(group = division_name), 
se = FALSE) 


For this geom we set the group aesthetic to division_name. 
(Division is a smaller census classification than region.) If we set it 
to state, we would get fifty separate smoothers in addition to our 
fifty trend lines. Then, using what we learned in chapter 4, we add 
ageom_text_repel() object that puts the label for each state at the 
end of the series. Because we are labeling lines rather than points, 
we want the state label to appear only at the end of the line. The 
trick is to subset the data so that only the points from the last year 
observed are used (and thus labeled). We also must remember to 
remove Washington, DC, again here, as the new data argument 
supersedes the original one in p0. 


p3 -f— p2 + geom_text_repel(data = subset(opiates 

t 



year == 

max (year) 

& abbr 

!="DC"), 

mapping = aes(x = year, y = 

adjusted, 

label = 

abbr), 

size = 1.8, segment.color = 

NA, nudge. 

_x = 30) 

+ 

coord_cartesian(c(min(opiates$year), 




max(opiates$year))) 








By default, geom_text_repel will add little line segments that 
indicate what the labels refer to. But that is not helpful here, as we 
are already dealing with the end point of a line. So we turn them off 
with the argument segment, color = NA. We also bump the labels 
off to the right of the lines a little, using the nudge_x argument, 
and use coord_cartesian() to set the axis limits so that there is 
enough room for them. 

Finally, we facet the results by census division and add our 
labels. A useful adjustment is to reorder the panels by the aver¬ 
age death rate. We put a minus in front of adjusted so that the 
divisions with the highest average rates appear in the chart first. 

p3 + labs(x = y = "Rate per 100,000 population", 

title = "State-Level Opiate Death Rates by Census Division, 1999-2014") 

facet_wrap(~ reorder(division_name, -adjusted, na.rm = TRUE), nrow = 3) 

Our new plot (fig. 7.18) brings out much of the overall story 
that is in the maps but also shifts the emphasis a bit. It is easier to 
see more clearly what is happening in some parts of the country. 

In particular you can see the climbing numbers in New Hamp¬ 
shire, Rhode Island, Massachussetts, and Connecticut. You can 
more easily see the state-level differences in the West, for instance 
between Arizona, on the one hand, and New Mexico or Utah on 
the other. And as was also visible on the maps, the astonishingly 
rapid rise in West Virginias death rate is also evident. Finally, the 
time-series plots are better at conveying the diverging trajecto¬ 
ries of various states within regions. There is more variance at the 
end of the series than at the beginning, especially in the North¬ 
east, Midwest, and South, and while this can be inferred from the 
maps, it is easier to see in the trend plots. 

The unit of observation in this graph is still the state-year. 

The geographically bound nature of the data never goes away. The 
lines we draw still represent states. Thus the basic arbitrariness of 
the representation cannot be made to disappear. In some sense, 
an ideal dataset here would be collected at some much more 
fine-grained level of unit, time, and spatial specificity. Imagine 
individual-level data with arbitrarily precise information on per¬ 
sonal characteristics, time, and location of death. In a case like 
that, we could then aggregate up to any categorical, spatial, or 
temporal units we liked. But data like that is extremely rare, often 
for good reasons that range from practicality of collection to the 
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Figure 7.18: The opiate data as a faceted time-series. 
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r-spatial. github. io/sf/. Also see news and 
updates at r-spatial .org. 


github.com/walkerke/tigris 


walkerke.github.io/tidycensus 


privacy of individuals. In practice we need to take care not to com¬ 
mit a kind of fallacy of misplaced concreteness that mistakes the 
unit of observation for the thing of real substantive or theoreti¬ 
cal interest. This is a problem for most kinds of social-scientific 
data. But their striking visual character makes maps perhaps more 
vulnerable to this problem than other kinds of visualization. 


7.6 Where to Go Next 

In this chapter, we learned how to begin to work with state-level 
and county-level data organized by FIPS codes. But this barely 
scratches the surface of visualization where spatial features and 
distributions are the main focus. The analysis and visualization of 
spatial data is its own research area, with its own research disci¬ 
plines in geography and cartography. Concepts and methods for 
representing spatial features are both well developed and standard¬ 
ized. Until recently, most of this functionality was accessible only 
through dedicated Geographical Information Systems. Their map¬ 
ping and spatial analysis features were not well connected. Or, at 
least, they were not conveniently connected to software oriented 
to the analysis of tabular data. 

This is changing fast. Brundson & Comber (2015) provide 
an introduction to some of Rs mapping capabilities. Mean¬ 
while, recently these tools have become much more accessible via 
the tidyverse. Of particular interest to social scientists is Edzer 
Pebesma’s ongoing development of the sf package, which imple¬ 
ments the standard Simple Features data model for spatial features 
in a tidyverse-friendly way. Relatedly, Kyle Walker and Bob Rudis’s 
tigris package allows for (sf-library-compatible) access to the 
U.S. Census Bureaus TIGER/Line shapefiles, which allow you to 
map data for many different geographical, administrative, and 
census-related subdivisions of the United States, as well as things 
like roads and water features. Finally, Kyle Walker’s tidycensus 
package (Walker 2018) makes it much easier to tidily get both 
substantive and spatial feature data from the U.S. Census and the 
American Community Survey. 



8 Refine Your Plots 


So far we have mostly used ggplot’s default output when mak¬ 
ing our plots, generally not looking at opportunities to tweak or 
customize things to any great extent. In general, when making 
figures during exploratory data analysis, the default settings in 
ggplot should be pretty good to work with. It’s only when we have 
some specific plot in mind that the question of polishing the results 
comes up. Refining a plot can mean several things. We might want 
to get the look of it just right, based on our own tastes and our 
sense of what needs to be highlighted. We might want to format 
it in a way that will meet the expectations of a journal, a confer¬ 
ence audience, or the general public. We might want to tweak this 
or that feature of the plot or add an annotation or additional detail 
not covered by the default output. Or we might want to completely 
change the look of the entire thing, given that all the structural ele¬ 
ments of the plot are in place. We have the resources in ggplot to 
do all these things. 

Let’s begin by looking at a new dataset, asasec. This is some 
data on membership over time in special-interest sections of the 
American Sociological Association. 


head(asasec) 


## 

Section 

Sname 

## 1 

Aging and the Life Course (018) 

Aging 

## 2 

Alcohol, Drugs and Tobacco (030) Alcohol/Drugs 

## 3 Altruism and Social Solidarity (047) 

Altruism 

## 4 

Animals and Society (042) 

Animals 

## 5 

Asia/Asian America (024) 

Asia 

## 6 

Body and Embodiment (048) 

Body 


## Beginning Revenues Expenses Ending Journal Year Members 


## 1 

12752 

12104 

12007 

12849 

No 2005 

598 

## 2 

11933 

1144 

400 

12677 

No 2005 

301 

## 3 

1139 

1862 

1875 

1126 

No 2005 

NA 

## 4 

473 

820 

1116 

177 

No 2005 

209 



Revenues 31 Revenues 
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## 5 9056 2116 1710 9462 No 2005 365 

## 6 3408 1618 1920 3106 No 2005 NA 


In this dataset, we have membership data for each section over 
a ten-year period, but the data on section reserves and income 
(the Beginning and Revenues variables) is for 2015 only Let’s 
look at the relationship between section membership and section 
revenues for a single year, 2014. 



ure 8.1: Back to basics. 


P *- ggplot(data = subset(asasec, Year == 2014), mapping = aes(x = Members, 
y = Revenues, label = Sname)) 

p + geom_point() + geom_smooth() 

## 'geom_smooth()' using method = 'loess 1 and formula 1 y ~ x 1 

Figure 8.1 is our basic scatterplot-and-smoother graph. To 
refine it, let’s begin by identifying some outliers, switch from loess 
to OLS, and introduce a third variable. This gets us to Figure 8.2. 

P <- ggplot(data = subset(asasec, Year == 2014), mapping = aes(x = Members, 
y = Revenues, label = Sname)) 

p + geom_point(mapping = aes(color = Journal)) + geom_smooth(method = "lm") 


Journal 
• No Yes 



Figure 8.2: Refining the plot. 


Now we can add some text labels. At this point it makes sense 
to use some intermediate objects to build things up as we go. We 
won’t show them all. But by now you should be able to see in your 
mind’s eye what an object like pi or p2 will look like. And of course 
you should type out the code and check if you are right as you go. 

p0 x- ggplot(data = subset(asasec, Year == 2014), mapping = aes(x = Members, 
y = Revenues, label = Sname)) 

pi <- p0 + geom_smooth(method = "lm", se = FALSE, color = "gray80") + 
geom_point(mapping = aes(color = Journal)) 

p2 <- pi + geom_text_repel(data = subset(asasec, Year == 2014 8 
Revenues > 7000), size = 2) 


Continuing with the p2 object, we can label the axes and scales. 
We also add a title and move the legend to make better use of the 
space in the plot (fig. 8.3). 
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Figure 8.3: Refining the axes. 


p3 p2 + labs(x="Membership", 
y="Revenues", 

color = "Section has own Journal", 
title = "ASA Sections", 
subtitle = "2014 Calendar year.", 
caption = "Source: ASA annual report.") 
p4 <- p3 + scale_y_continuous(labels = scales: :dollar) + 
theme(legend.position = "bottom") 


8.1 Use Color to Your Advantage 

You should choose a color palette in the first place based on its 
ability to express the data you are plotting. An unordered cat¬ 
egorical variable like “country” or “sex,” for example, requires 
distinct colors that won’t be easily confused with one another. An 
ordered categorical variable like “level of education,” on the other 
hand, requires a graded color scheme of some kind running from 
less to more or earlier to later. There are other considerations, 
too. For example, if your variable is ordered, is your scale cen¬ 
tered on a neutral midpoint with departures to extremes in each 
direction, as in a Likert scale? Again, these questions are about 
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Figure 8.4: RColorBrewer's sequential palettes. 



Figure 8.5: RColorBrewer's diverging palettes. 
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Figure 8.6: RColorBrewer's qualitative palettes. 


ensuring accuracy and fidelity when mapping a variable to a color 
scale. Take care to choose a palette that reflects the structure of 
your data. For example, do not map sequential scales to cate¬ 
gorical palettes, or use a diverging palette for a variable with no 
well-defined midpoint. 

Separate from these mapping issues, there are considerations 
about which colors in particular to choose. In general, the default 
color palettes that ggplot makes available are well chosen for their 
perceptual properties and aesthetic qualities. We can also use color 
and color layers as device for emphasis, to highlight particular data 
points or parts of the plot, perhaps in conjunction with other 
features. 

We choose color palettes for mappings through one of the 
scale_ functions for color or fill. While it is possible to 
very finely control the look of your color schemes by vary¬ 
ing the hue, chroma, and luminance of each color you use 
via scale_color_hue() or scale_fill_hue(), in general this is 
not recommended. Instead you should use the RColorBrewer 
package to make a wide range of named color palettes avail¬ 
able to you, and choose from those. Figures 8.4, 8.5, and 8.6 
show the available options for sequential, diverging, and qual¬ 
itative variables. When used in conjunction with ggplot, you 
access these colors by specifying the scale_color_brewer() or 
scale_fill_brewer() functions, depending on the aesthetic you 
are mapping. Figure 8.7 shows how to use the named palettes in 
this way. 


p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors, 
color = world)) 

p + geom_point(size = 2) + scale_color_brewer(palette = "Set2") + 
theme(legend.position = "top") 

p + geom_point(size = 2) + scale_color_brewer(palette = "Pastel2") + 
theme(legend.position = "top") 

p + geom_point(size = 2) + scale_color_brewer(palette = "Dark2") + 
theme(legend.position = "top") 


You can also specify colors manually, via scale_color_ 
manual () or scale_fill_manual(). These functions take a value 
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argument that can be specified as vector of color names or color 
values that R knows about. R knows many color names (like red, 
and green, and cornflowerblue). Try demo(' colors 1 ) for an 
overview. Alternatively, color values can be specified via their hex¬ 
adecimal RGB value. This is a way of encoding color values in the 
RGB colorspace, where each channel can take a value from 0 to 
255. A color hex value begins with a hash or pound character, #, 
followed by three pairs of hexadecimal or “hex” numbers. Hex val¬ 
ues are in Base 16, with the first six letters of the alphabet standing 
for the numbers 10 to 15. This allows a two-character hex number 
to range from 0 to 255. You read them as ftrrggbb, where rr is the 
two-digit hex code for the red channel, gg for the green channel, 
and bb for the blue channel. So #CC55DD translates in decimal to 
CC = 204 (red), 55 = 85 (green), and DD = 221 (blue). It gives a 
strong pink color. 

Going back to our ASA membership plot, for example, for 
figure 8.8 we can manually introduce a palette from Chang (2013) 
that’s friendly to viewers who are color blind. 


cb_palette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", 
"#0072B2", "KD55E00", "#CC79A7") 

p4 + scale_color_manual(values = cb_palette) 


While we can specify colors manually, this work has already 
been done for us. If we are serious about using a safe palette for 
color-blind viewers, we should investigate the dichromat package 
instead. It provides a range of palettes and some useful functions 
for helping you see approximately what your current palette might 
look like to a viewer with one of several different kinds of color 
blindness. 

For example, let’s use RColorBrewer’s brewer.pal() function 
to get five colors from ggplot’s default palette. 
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Figure 8.7: Some available palettes in use. 


The colorblindr package has similar functionality. 


Default brewer.pal(5, "Set2") 


Next, we can use a function from the dichromat package to 
transform these colors to new values that simulate different kinds 
of color blindness. The results are shown in figure 8.9. 
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Figure 8.8: Using a custom color palette. 


ASA sections 


2014 calendaryear. 




• 

Mental health 

Comm/Urban 



















Teaching 









Aging 

• 




• 





• 


• 


Medical 



• 

s.* • 

-vr 

f 

• 

• 

• 

• • 

• • • 
# 

• 

• 

• 



—i-1-1-1- 

250 500 750 1000 1250 

Membership 


Section has own Journal ^ No • Yes 


Source: ASA annual report. 


Default 


Deuteronopia 


Protanopia 


Tritanopia 


Figure 8.9: Comparing a default color palette with 
an approximation of how the same palette appears 
to people with one of three kinds of color blindness. 










library(dichromat) 

types <- c("deutan", "protan", "tritan") 

names (types) c("Deuteronopia", "Protanopia", "Tritanopia") 

color_tabie -e- types purrr: :map(~dichromat(Default, .x)) 
as_tibble() add_column(Defauit, .before = TRUE) 

color_table 


## # A tibble: 5x4 

## Default Deuteronopia Protanopia Tritanopia 


## <chr> <chr> 

## 1 #66C2A5 IAEAEA7 
## 2 #FC8D62 IB6B661 
## 3 #8DA0CB #9C9CCB 
## 4 #E78AC3 #ACACC1 
## 5 IA6D854 #CACA5E 


<chr> 

<chr> 

IBABAA5 

#82BDBD 

I9E9E63 

#F29494 

I9E9ECB 

#92ABAB 

I9898C3 

#DA9C9C 

ID3D355 

IB6C8C8 


color_comp(color_table) 


In this code, we create a vector of types of color blindness 
that the dichromatQ function knows about and give them proper 
names. Then we make a table of colors for each type using the 
purrr library’s map () function. The rest of the pipeline converts 
the results from a list to a tibble and adds the original colors as the 
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first column in the table. We can now plot them to see how they 
compare, using a convenience function from the socviz library 
The ability to manually specify colors can be useful when the 
meaning of a category itself has a strong color association. Politi¬ 
cal parties, for example, tend to have official or quasi-official party 
colors that people associate with them. In such cases it is helpful to 
be able to present results for, say, the Green Party in a green color. 
When doing this, it is worth keeping in mind that some colors are 
associated with categories (especially categories of person) for out¬ 
moded reasons, or no very good reason. Do not use stereotypical 
colors just because you can. 


8.2 Layer Color and Text Together 


Aside from mapping variables directly, color is also useful when 
we want to pick out or highlight some aspect of our data. In 
cases like this, the layered approach of ggplot can really work 
to our advantage. Let’s work through an example where we use 
manually specified colors both for emphasis and because of their 
social meaning. 

We will build up a plot of data about the U.S. general elec¬ 
tion in 2016. It is contained in the county_data object in the 
socviz library. We begin by defining a blue and red color for the 
Democrats and Republicans, respectively. Then we create the basic 
setup and first layer of the plot. We subset the data, including 
only counties with a value of “No” on the flipped variable. We 
set the color of geom_point() to be a light gray, as it will form 
the background layer of the plot (fig. 8.10). And we apply a log 
transformation to the x-axis scale. 



Figure 8.10: The background layer. 


# Democrat Blue and Republican Red 
party.colors <- c("#2E74C0", "#CB454A") 

p0 c- ggplot(data = subset(county_data, 

flipped == "No"), 
mapping = aes(x = pop, 

y = black/100)) 

pi •<- p0 + geom_point(alpha = 0.15, color = "gray50") + 
scale_x_log10(labels=scales::comma) 


Pi 
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Partywinner16 • Democrat • Republican 



Flipped counties, 2016 

County flipped to... ‘Democrat ‘Republican 



Figure 8.12: Adding guides and labels, and fixing 
the y-axis scale. 


In the next step (fig. 8.11) we add a second geom_point() 
layer. Here we start with the same dataset but extract a comple¬ 
mentary subset from it. This time we choose the “Yes” counties 
on the flipped variable. The x and y mappings are the same, but 
we add a color scale for these points, mapping the partywinnerl 6 
variable to the color aesthetic. Then we specify a manual color 
scale with scale_color_manual(), where the values are the blue 
and red party_colors we defined above. 


p2 pi + geom_point(data = subset(county_data, 

flipped == "Yes"), 

mapping = aes(x = pop, y = black/100, 

color = partywinner16)) + 
scale_color_manual(values = party_colors) 

p2 


The next layer sets the y-axis scale and the labels (fig. 8.12). 


p3 p2 + scale_y_continuous(labels=scales::percent) + 
labs(color = "County flipped to ... ", 

x = "County Population (log scale)", 
y = "Percent Black Population", 
title = "Flipped counties, 2016", 
caption = "Counties in gray did not flip.") 

p3 


Finally, we add a third layer using the geom_text_repel() 
function. Once again we supply a set of instructions to subset the 
data for this text layer. We are interested in the flipped counties 
that have a relatively high percentage of African American resi¬ 
dents. The result, shown in figure 8.13, is a complex but legible 
multilayer plot with judicious use of color for variable coding and 
context. 


p4 •<- p3 + geom_text_repel(data = subset(county_data, 

flipped == "Yes" & 
black > 25), 
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Flipped counties, 2016 


County flipped to... •Democrat • Republican 



County population (log scale) 


Counties in gray did not flip. 


Figure 8.13: County-level election data from 2016. 


mapping = aes(x = pop, 
y = black/100, 
label = state), size = 2) 


p4 + theme_minimal() + 

theme(legend.position="top") 


When producing a graphic like this in ggplot, or when look¬ 
ing at good plots made by others, it should gradually become your 
habit to see not just the content of the plot but also the implicit 
or explicit structure that it has. First, you will be able to see the 
mappings that form the basis of the plot, picking out which vari¬ 
ables are mapped to x and y, and which to to color, fill, shape, 
label, and so on. What geoms were used to produce them? Sec¬ 
ond, how have the scales been adjusted? Are the axes transformed? 
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Are the fill and color legends combined? And third, especially as 
you practice making plots of your own, you will find yourself pick¬ 
ing out the layered structure of the plot. What is the base layer? 
What has been drawn on top of it, and in what order? Which 
upper layers are formed from subsets of the data? Which are new 
datasets? Are there annotations? The ability to evaluate plots in 
this way, to apply the grammar of graphics in practice, is use¬ 
ful both for looking at plots and for thinking about how to make 
them. 


8.3 Change the Appearance of Plots with Themes 

Our elections plot is in a pretty finished state. But if we want to 
change the overall look of it all at once, we can do that using 
ggplot’s theme engine. Themes can be turned on or off using the 
theme_set() function. It takes the name of a theme (which will 
itself be a function) as an argument. Try the following: 

theme_set(theme_bw()) 

p4 + theme(legend.position = "top") 

theme_set(theme_dark()) 

p4 + theme(legend.position = "top") 

Internally, theme functions are a set of detailed instructions 
to turn on, turn off, or modify a large number of graphical ele¬ 
ments on the plot. Once set, a theme applies to all subsequent 
plots, and it remains active until it is replaced by a different theme. 
This can be done either through the use of another theme_set() 
statement or on a per plot basis by adding the theme function to 
the end of the plot: p4 + theme_gray () would temporarily over¬ 
ride the generally active theme for the p4 object only. You can 
still use the theme () function to fine-tune any aspect of your plot, 
as seen above with the relocation of the legend to the top of the 
graph. 

The ggplot library comes with several built-in themes, includ¬ 
ing theme_minimal() and theme_classic(), with theme_gray() 
or theme_grey() as the default. If these are not to your taste, 
install the ggthemes package for many more options. You can, for 
example, make ggplot output look like it has been featured in the 
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Figure 8.14: Economist and WSJ themes. 


Economist, or the Wall Street Journal, or in the pages of a book by 
Edward Tufte. Figure 8.14 shows two examples. 

Using some themes might involve adjusting font sizes or other 
elements as needed, if the defaults are too large or small. If you use 
a theme with a colored background, you will also need to consider 
what color palette you are using when mapping to color or fill 
aesthetics. You can define your own themes either entirely from 
scratch or by starting with one you like and making adjustments 
from there. 


library(ggthemes) 

theme_set(theme_economist()) 
p4 + theme(legend.position="top") 

theme_set(theme_wsj()) 

p4 + theme(plot.title = element_text(size = rel(0.6)), 

legend.title = element_text(size = rel(0.35)), 
plot.caption = element_text(size = rel(0.35)), 
legend.position = "top") 


Generally speaking, themes with colored backgrounds and 
customized typefaces are best used when making one-off graphics 
or posters, preparing figures to integrate into a slide presentation, 
or conforming to a house or editorial style for publication. Take 
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care to consider how the choices you make will harmonize with 
the broader printed or displayed material. Just as with the choice 
of palettes for aesthetic mappings, when starting out it can be wis¬ 
est to stick to the defaults or consistently use a theme that has had 

It also contains some convenience functions for its kinks already ironed out. Claus O. Wilkes cowplot package, 

laying out several plot objects in a single figure, f or i ns t ance> contains a well-developed theme suitable for figures 

among other features, as we shall see below in one 

of the case studies. whose final destination is a journal article. Bob Rudis’s hrbrthemes 

package, meanwhile, has a distinctive and compact look and feel 
that takes advantage of some freely available typefaces. Both are 
available via install. packages(). 

The theme () function allows you to exert fine-grained con¬ 
trol over the appearance of all kinds of text and graphical elements 
in a plot. For example, we can change the color, typeface, and 
font size of text. If you have been following along writing your 
code, you will have noticed that the plots you make have not been 
identical to the ones shown in the text. The axis labels are in a 
slightly different place from the default, the typeface is different, 
and there are other, smaller changes as well. The theme_book() 
function provides the custom ggplot theme used throughout this 
book. The code for this theme is based substantially on Bob Rudis’s 
theme_ipsum(), from his hrbrthemes library. You can learn more 
about it in the appendix. For figure 8.15, we then adjust that theme 
even further by tweaking the text size, and we also remove a num¬ 
ber of elements by naming them and making them disappear using 
element_blank(). 


p4 + themeflegend.position = "top") 

p4 + themeflegend.position = "top", 

plot.title = element_text(size=rel(2), 
lineheight=.5, 
family="Times", 
face="bold.italic", 
colour="orange"), 

axis.text.x = element_text(size=rel(1.1), 

family="Courier", 

face="bold", 

color="purple")) 
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Flipped counties, 2016 
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Counties in gray did not f ip. 


Figure 8.15: Controlling various theme elements directly (and making several bad choices while doing so). 


8.4 Use Theme Elements in a Substantive Way 

It makes good sense to use themes as a way to fix design elements 
because that means you can subsequently ignore them and focus 
instead on the data you are examining. But it is also worth remem¬ 
bering that ggplot’s theme system is very flexible. It permits a wide 
range of design elements to be adjusted to create custom figures. 
For instance, following an example from Wehrwein (2017), we will 
create an effective small multiple of the age distribution of GSS 
respondents over the years. The gss_lon data contains informa¬ 
tion on the age of each GSS respondent for all the years in the 
survey since 1972. The base figure 8.16 is a scaled geom_density () 
layer of the sort we saw earlier, this time faceted by the year vari¬ 
able. We will fill the density curves with a dark gray color and 
then add an indicator of the mean age in each year, and a text 
layer for the label. With those in place we then adjust the detail 
of several theme elements, mostly to remove them. As before, we 
use element_text() to tweak the appearance of various text ele¬ 
ments such as titles and labels. And we also use element_blank() 
to remove several of them altogether. 

First, we need to calculate the mean age of the respondents for 
each year of interest. Because the GSS has been around for most 
(but not all) years since 1972, we will look at distributions about 
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Age distribution of 
GSS respondents 




Figure 8.1 6: A customized small multiple. 


every four years since the beginning. We use a short pipeline to 
extract the average ages. 

yrs <- c(seq(1972, 1988, 4), 1993, seq(l996, 2016, 4)) 

mean_age <- gss_lon '/>'/ 

filter(age Znin'/ NA 88 year '/.in'/, yrs) '/>'/ 
group_by(year) '/>'/ 

summarize(xbar = round(mean(age, na.rm = TRUE), 0)) 
mean_age$y <-0.3 

yr_labs <- data.frame(x =85, y = 0.8, 
year = yrs) 


The y column in mean_age will come in handy when we want 
to position the age as a text label. Next, we prepare the data and set 
up the geoms. 

p <- ggplot(data = subset(gss_lon, year '/in'/ yrs), 
mapping = aes(x = age)) 

pi <- p + geom_density(fill = "gray20", color = FALSE, 

alpha = 0.9, mapping = aes(y = ..scaled..)) + 
geom_vline(data = subset(mean_age, year '/in'/ yrs), 

aes(xintercept = xbar), color = "white", size = 0.5) + 
geom_text(data = subset(mean_age, year '/in'/ yrs), 

aes(x = xbar, y = y, label = xbar), nudge_x = 7.5, 
color = "white", size = 3.5, hjust = 1) + 
geom_text(data = subset(yr_labs, year '/in'/ yrs), 
aes(x = x, y = y, label = year)) + 
facet_grid(year ~ ., switch = "y") 


The initial p object subsets the data by the years we have cho¬ 
sen and maps x to the age variable. The geom_density () call is 
the base layer, with arguments to turn off its default line color, set 
the fill to a shade of gray, and scale the y-axis between zero and 
one. 

Using our summarized dataset, the geom_vline() layer draws 
a vertical white line at the mean age of the distribution. The first of 
two text geoms labels the age line (in white). The first geom_text() 


call uses a nudge argument to push the label slightly to the right 
of its x-value. The second labels the year. We do this because 
we are about to turn off the usual facet labels to make the plot 
more compact. Finally we use facet_grid() to break out the age 
distributions by year. We use the switch argument to move the 
labels to the left. 

With the structure of the plot in place, we then style the ele¬ 
ments in the way that we want, using a series of instructions to 
theme(). 


pi + theme_book(base_size =10, plot_title_size =10, 

strip_text_size = 32, panel_spacing = unit(0.1, "lines")) 
theme(plot.title = element_text(size = 16), 
axis.text.x= element_text(size = 12), 
axis.title.y=element_blank(), 
axis.text.y=element_blank(), 
axis.ticks.y = element_blank(), 
strip.background = element_blank(), 
strip.text.y = element_blank(), 
panel.grid.major = element_blank(), 
panel.grid.minor = element_blank()) + 
labs(x = "Age", 
y = NULL, 

title = "Age Distribution of\nGSS Respondents") 


One of the pleasing things about ggplot’s developer commu¬ 
nity is that it often takes plot ideas that are first worked out 
in a one-off or bespoke way and generalizes them to the point 
where they are available as new geoms. Shortly after writing the 
code for the GSS age distributions in figure 8.16, the ggridges 
package was released. Written by Claus O. Wilke, it offers a dif¬ 
ferent take on small-multiple density plots by allowing the dis¬ 
tributions to overlap vertically to interesting effect. It is espe¬ 
cially useful for repeated distributional measures that change in 
a clear direction. In figure 8.17 we redo our previous plot using a 
function from ggridges. Because geom_density_ridges() makes 
for a more compact display we trade off showing the mean age 
value for the sake of displaying the distribution for every GSS 
year. 
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library(ggridges) 
p ggplot(data = gss_lon, 

mapping = aes(x = age, y = factor(year, levels = rev(unique(year)), 
ordered = TRUE))) 

p + geom_density_ridges(alpha = 0.6, fill = "lightblue", scale = 1.5) + 
scale_x_continuous(breaks = c(25, 50, 75)) + 
scale_y_discrete(expand = c(0.01, 0)) + 
labs(x = "Age", y = NULL, 

title = "Age Distribution of\nGSS Respondents") + 
theme_ridges() + 

theme(title = element_text(size = 16, face = "bold")) 

The expand argument in scale_y_discrete() adjusts the 
scaling of the y-axis slightly. It has the effect of shortening the dis¬ 
tance between the axis labels and the first distribution, and it also 
prevents the top of the very first distribution from being cut off 
by the frame of the plot. The package also comes with its own 
theme, theme_ridges(), that adjusts the labels so that they are 
aligned properly, and we use it here. The geom_density_ridges() 
function is also capable of reproducing the look of our original 
version. The degree of overlap in the distributions is controlled by 
the scale argument in the geom. You can experiment with setting 
it to values below or above one to see the effects on the layout of 
the plot. 

Much more detailed information on the names of the vari¬ 
ous elements you can control via theme() can be found in the 
ggplot documentation. Setting these thematic elements in an ad 
hoc way is often one of the first things people want to do when 
they make a plot. But in practice, apart from getting the overall 
size and scale of your plot squared away, making small adjust¬ 
ments to theme elements should be the last thing you do in the 
plotting process. Ideally, once you have set up a theme that works 
well for you, it should be something you can avoid having to do 
at all. 


Figure 8.1 7: A ridgeplot version of the age 
distribution plot. 








8.5 Case Studies 


Bad graphics are everywhere. Better ones are within our reach. For 
the final sections of this chapter we will work through some com¬ 
mon visualization problems or dilemmas, as seen through some 
real-life cases. In each case we will look at the original figures and 
redraw them in new (and better) versions. In the process we will 
introduce a few new functions and features of ggplot that we have 
not seen yet. This, too, is true to life. Usually, it’s having to face 
some practical design or visualization question that forces us to 
ferret out the solution to our problem in the documentation or 
come up with some alternative answer on the fly ourselves. Let’s 
start with a common case: the use of dual axes in trend plots. 


Two y-axes 

In January 2016 Liz Ann Sonders, chief investment strategist 
with Charles Schwab, Inc., tweeted about the apparent correlation 
between two economic time series: the Standard and Poor’s 500 
stock market index and the Monetary Base, a measure of the size of 
money supply. The S&P is an index that ranges from about 700 to 
about 2,100 over the period of interest (about the last seven years). 
The Monetary Base ranges from about 1.5 trillion to 4.1 trillion 
dollars over the same period. This means that we can’t plot the two 
series directly. The Monetary Base is so much larger that it would 
make the S&P 500 series appear as a flat line at the bottom. While 
there are several reasonable ways to address this, people often opt 
instead to have two y-axes. 

Because it is designed by responsible people, R makes it 
slightly tricky to draw graphs with two y-axes. In fact, ggplot rules 
it out of order altogether. It is possible to do it using R’s base graph¬ 
ics, if you insist. Figure 8.18 shows the result. (You can find the 
code at https://github.com/kjhealy/two-y-axes) Graphics in 
base R work very differently from the approach we have taken 
throughout this book, so it would just be confusing to show the 
code here. 

Most of the time when people draw plots with two y-axes they 
want to line the series up as closely as possible because they sus¬ 
pect that there’s a substantive association between them, as in this 
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Figure 8.18: Two time series, each with its own 
y-axis. 


Figure 8.19: Variations on two y-axes. 



case. The main problem with using two y-axes is that it makes it 
even easier than usual to fool yourself (or someone else) about the 
degree of association between the variables. This is because you 
can adjust the scaling of the axes relative to one another in a way 
that moves the data series around more or less however you like. In 
figure 8.18 the red Monetary Base line tracks below the blue S&P 
500 for the first half of the graph and is above it for the second half. 

We can “fix” that by deciding to start the second y-axis at zero, 
which shifts the Monetary Base line above the S&P line for the first 
half of the series and below it later on. The first panel in figure 8.19 
shows the results. The second panel, meanwhile, adjusts the axes 
so that the axis tracking the S&P starts at zero. The axis tracking 
the Monetary Base starts around its minimum (as is generally good 
practice), but now both axes max out around 4,000. The units are 
different, of course. The 4,000 on the S&P side is an index number, 
while the Monetary Base number is 4,000 billion dollars. The effect 
is to flatten out the S&P’s apparent growth quite a bit, muting the 
association between the two variables substantially. You could tell 
quite a different story with this one, if you felt like it. 

How else might we draw this data? We could use a split- or 
broken-axis plot to show the two series at the same time. These can 
be effective sometimes, and they seem to have better perceptual 
properties than overlayed charts with dual axes (Isenberg et al. 
2011). They are most useful in cases where the series you are plot¬ 
ting are of the same kind but of very different magnitudes. That is 
not the case here. 





Refine Your Plots . 217 


Another compromise, if the series are not in the same units 
(or of widely differing magnitudes), is to rescale one of the series 
(e.g., by dividing or multiplying it by a thousand), or alternatively 
to index each of them to 100 at the start of the first period and then 
plot them both. Index numbers can have complications of their 
own, but here they allow us to use one axis instead of two, and also 
to calculate a sensible difference between the two series and plot 
that as well, in a panel below. It can be tricky to visually estimate 
the difference between series, in part because our perceptual ten¬ 
dency is to look for the nearest comparison point in the other series 
rather than the one directly above or below. Following Cleveland 
(1994), we can also add a panel underneath that tracks the run¬ 
ning difference between the two series. We begin by making each 
plot and storing them in an object. To do this, it will be convenient 
to tidy the data into a long format, with the indexed series in the 
key variable and their corresponding scores as the values. We use 
tidyr’s gather () function for this: 

head(fredts) 

## date sp500 monbase sp500_i monbase_i 

## 1 2009-03-11 696.68 1542228 100.000 100.000 

## 2 2009-03-18 766.73 1693133 110.055 109.785 

## 3 2009-03-25 799.10 1693133 114.701 109.785 

## 4 2009-04-01 809.06 1733017 116.131 112.371 

## 5 2009-04-08 830.61 1733017 119.224 112.371 

## 6 2009-04-15 852.21 1789878 122.324 116.058 


fredts_m fredts select(date, sp500_i, monbase_i) gather(key = series, 
value = score, sp500_i:monbase_i) 

head(fredts_m) 

## date series score 

## 1 2009-03-11 sp500_i 100.000 
## 2 2009-03-18 sp500_i 110.055 
## 3 2009-03-25 sp500_i 114.701 
## 4 2009-04-01 sp500_i 116.131 
## 5 2009-04-08 sp500_i 119.224 
## 6 2009-04-15 sp500_i 122.324 

Once the data is tidied in this way we can make our graph. 
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p <- ggplot(data = fredts_m, 

mapping = aes(x = date, y = score, 
group = series, 
color = series)) 

pi p + geom_line() + theme(legend.position = "top") + 
labs(x = "Date", 
y = "Index", 
color = "Series") 

p ggplot(data = fredts, 

mapping = aes(x = date, y = sp500_i - monbase_i)) 

p2 p + geom_line() + 
labs(x = "Date", 

y = "Difference") 

Now that we have our two plots, we want to lay them out nicely. 
We do not want them to appear in the same plot area, but we do 
want to compare them. It would be possible to do this with a facet, 
but that would mean doing a fair amount of data munging to get 
all three series (the two indices and the difference between them) 
into the same tidy data frame. An alternative is to make two sepa¬ 
rate plots and then arrange them just as we like. For instance, have 
the comparison of the two series take up most of the space, and 
put the plot of the index differences along the bottom in a smaller 
area. 

The layout engine used by R and ggplot, called grid, does 
make this possible. It controls the layout and positioning of plot 
areas and objects at a lower level than ggplot. Programming 
grid layouts directly takes a little more work than using ggplot’s 
functions alone. Fortunately, there are some helper libraries that 
we can use to make things easier. One possibility is to use the 
gridExtra library. It provides a number of useful functions that let 
us talk to the grid engine, including grid. arrangeQ. This func¬ 
tion takes a list of plot objects and instructions for how we would 
like them arranged. The cowplot library we mentioned earlier 
makes things even easier. It has a plot_grid() function that works 
much like grid.arrangeQ while also taking care of some fine 
details, including the proper alignment of axes across separate plot 
objects. 
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Series — monbasej - sp500_i 



Figure 8.20: Indexed series with a running 
difference below, using two separate plots. 


cowplot::plot_grid(pi, p2, nrow = 2, rel_heights = c(0.75, 0.25), 

align = "v") 

The result is shown in figure 8.20. It looks pretty good. In this 
version, the S&P index runs above the Monetary Base for almost 
the whole series, whereas in the plot as originally drawn, they 
crossed. 

The broader problem with dual-axis plots of this sort is 
that the apparent association between these variables is probably 
spurious. The original plot is enabling our desire to spot patterns, 
but it is probably the case that both of these time series are tend¬ 
ing to increase but are not otherwise related in any deep way. If we 
were interested in establishing the true association between them, 
we might begin by naively regressing one on the other. We can try 
to predict the S&P index from the Monetary Base, for instance. If 
we do that, things look absolutely fantastic to begin with, as we 
appear to explain about 95 percent of the variance in the S&P just 
by knowing the size of the Monetary Base from the same period. 
We’re going to be rich! 

Sadly, we’re probably not going to be rich. While everyone 
knows that correlation is not causation, with time-series data we 
get this problem twice over. Even just considering a single series, 
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each observation is often pretty closely correlated with the obser¬ 
vation in the period immediately before it, or perhaps with the 
observation some regular number of periods before it. A time 
series might have a seasonal component that we would want to 
account for before making claims about its growth, for example. 
And if we ask what predicts its growth, then we will introduce 
some other time series, which will have trend properties of its own. 
In those circumstances, we more or less automatically violate the 
assumptions of ordinary regression analysis in a way that produces 
wildly overconfident estimates of association. The result, which 
may seem paradoxical when you first run across it, is that a lot of 
the machinery of time-series analysis is about making the serial 
nature of the data go away. 

Like any rule of thumb, it is possible to come up with excep¬ 
tions, or talk oneself into them. We can imagine situations where 
the judicious use of dual y-axes might be a sensible way to present 
data to others, or might help a researcher productively explore a 
dataset. But in general I recommend against it because it is already 
much too easy to present spurious, or at least overconfident, asso¬ 
ciations, especially with time-series data. Scatterplots can do that 
just fine. Even with a single series, as we saw in chapter 1, we can 
make associations look steeper or flatter by fiddling with the aspect 
ratio. Using two y-axes gives you an extra degree of freedom to 
mess about with the data that, in most cases, you really should not 
take advantage of. A rule like this will not stop people who want to 
fool you with charts from trying, of course. But it might help you 
not fool yourself. 


Redrawing a bad slide 

In late 2015 Marissa Mayer’s performance as CEO of Yahoo was 
being criticized by many observers. One of them, Eric Jackson, 
an investment fund manager, sent a ninety-nine-slide presenta¬ 
tion to Yahoo’s board outlining his best case against Mayer. (He 
also circulated it publicly.) The style of the slides was typical of 
business presentations. Slides and posters are a very useful means 
of communication. In my experience, most people who com¬ 
plain about “death by PowerPoint” have not sat through enough 
talks where the presenter hasn’t even bothered to prepare slides. 
But it is striking to see how fully the “slide deck” has escaped 
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Yahoo full time employee headcount (2004-2015) vs. revenue (2004-2015) 



Source: Company filings (10K), analyst calls 


Figure 8.21: A bad slide. (From the 
December 2015 Yahoo! Investor 
Presentation: A Better Plan for 
Yahoo Shareholders.) 


its origins as an aid to communication and metastasized into a 
freestanding quasi-format of its own. Business, the military, and 
academia have all been infected by this tendency in various ways. 
Never mind taking the time to write a memo or an article, just 
give us endless pages of bullet points and charts. The disorient¬ 
ing effect is of constant summaries of discussions that never took 
place. 

In any case, figure 8.21 reproduces a typical slide from the 
deck. It seems to want to say something about the relationship 
between Yahoo’s number of employees and its revenue, in the con¬ 
text of Mayer’s tenure as CEO. The natural thing to do would be 
to make some kind of scatterplot to see if there was a relationship 
between these variables. Instead, however, the slide puts time on 
the x-axis and uses two y-axes to show the employee and revenue 
data. It plots the revenues as a bar chart and the employee data 
as points connected by slightly wavy lines. At first glance, it is 
not clear whether the connecting line segments are just manually 
added or if there’s some principle underlying the wiggles. (They 
turn out to have been created in Excel.) The revenue values are 
used as labels within the bars. The points are not labeled. Employee 
data goes to 2015, but revenue data only to 2014. An arrow points 
to the date Mayer was hired as CEO, and a red dotted line seems 
to indicate ... actually I’m not sure. Maybe some sort of threshold 
below which employee numbers should fall? Or maybe just the last 
observed value, to allow comparison across the series? It isn’t clear. 
Finally, notice that while the revenue numbers are annual, there is 
more than one observation per year for some of the later employee 
numbers. 
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How should we redraw this chart? Let’s focus on getting across 
the relationship between employee numbers and revenue, as that 
seems to be the motivation for it in the first place. As a secondary 
element, we want to say something about Mayer’s role in this rela¬ 
tionship. The original sin of the slide is that it plots two series of 
numbers using two different y-axes, as discussed above. We see 
this from business analysts more often than not. Time is almost 
the only thing they ever put on the x-axis. 

To redraw the chart, I took the numbers from the bars on 
the chart together with employee data from QZ.com. Where there 
was quarterly data in the slide, I used the end-of-year number for 
employees, except for 2012. Mayer was appointed in July 2012. 
Ideally we would have quarterly revenue and quarterly employee 
data for all years, but given that we do not, the most sensible 
thing to do is to keep things annualized except for the one year of 
interest, when Mayer arrives as CEO. It’s worth doing this because 
otherwise the large round of layoffs that immediately preceded her 
arrival would be misattributed to her tenure as CEO. The upshot 
is that we have two observations for 2012 in the dataset. They have 
the same revenue data but different employee numbers. The figures 
can be found in the yahoo dataset. 


head(yahoo) 


## Year Revenue Employees Mayer 


## 1 2004 

3574 

7600 

No 

## 2 2005 

5257 

9800 

No 

## 3 2006 

6425 

11400 

No 

## 4 2007 

6969 

14300 

No 

## 5 2008 

7208 

13600 

No 

## 6 2009 

6460 

13900 

No 


The redrawing is straightforward. We could just draw a scat- 
terplot and color the points by whether Mayer was CEO at the 
time. By now you should know how to do this quite easily. We can 
take a small step further by making a scatterplot but also holding 
on to the temporal element beloved of business analysts. We can 
use geom_path() and use line segments to “join the dots” of the 
yearly observations in order, labeling each point with its year. The 
result (fig. 8.22) is a plot that shows the trajectory of the company 



over time, like a snail moving across a flagstone. Again, bear in 
mind that we have two observations for 2012. 
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p <- ggplot(data = yahoo, 

mapping = aes(x = Employees, y = Revenue)) 
p + geom_path(color = "gray80") + 

geom_text(aes(color = Mayer, label = Year), 
size = 3, fontface = "bold") + 
theme(legend.position = "bottom") + 
labs(color = "Mayer is CEO", 

x = "Employees", y = "Revenue (Millions)", 
title = "Yahoo Employees vs Revenues, 2004-2014") + 
scale_y_continuous(labels = scales::dollar) + 
scale_x_continuous(labels = scales::comma) 


This way of looking at the data suggests that Mayer was 
appointed after a period of falling revenues and just following 
a very large round of layoffs, a fairly common pattern with the 
leadership of large firms. Since then, through either new hires or 
acquisitions, employee numbers have crept back up a little while 
revenues have continued to fall. This version more clearly conveys 
what the original slide was trying to get across. 

Alternatively, we can keep the analyst community happy by 
putting time back on the x-axis and plotting the ratio of revenue 
to employees on the y-axis. This gives us the linear time-trend 
back, only in a more sensible fashion (fig. 8.23). We begin the 
plot by using geom_vline() to add a vertical line marking Mayer’s 
accession to the CEO position. 


Yahoo employees vs revenues, 2004-2014 



Figure 8.22: Redrawing as a connected scatterplot. 


p <- ggplot(data = yahoo, 

mapping = aes(x = Year, y = Revenue/Employees)) 

p + geom_vline(xintercept = 2012) + 

geom_line(color = "gray60", size = 2) + 
annotate("text", x = 2013, y = 0.44, 

label = " Mayer becomes CEO", size = 2.5) + 
labs(x = "Year\n", 

y = "Revenue/Employees", 

title = "Yahoo Revenue to Employee Ratio, 2004-2014") 
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Figure 8.23: Plotting the ratio of revenue to 
employees against time. (From the December 2015 
Yahoo! Investor Presentation: A Better Plan for 
Yahoo Shareholders.) 


Yahoo revenue to employee ratio, 2004-2014 



Year 


Figure 8.24: Data on the structure of U.S. student 
debts as of 2016. 


Borrower distribution by 
outstanding balance 

out of 44 million borrowers in 2016 



Debt distribution by 
outstanding balance 

out of $1.3 trillion in 2016 
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i 200k+ 


15% 
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Saying no to pie 

For a third example, we turn to pie charts. Figure 8.24 shows a 
pair of charts from a New York Federal Reserve Bank briefing on 
the structure of debt in the United States (Chakrabarti et al. 2017). 
As we saw in chapter 1, the perceptual qualities of pie charts are 
not great. In a single pie chart, it is usually harder than it should 
be to estimate and compare the values shown, especially when 
there are more than a few wedges and when there are a number of 
wedges reasonably close in size. A Cleveland dotplot or a bar chart 




is usually a much more straightforward way of comparing quan¬ 
tities. When comparing the wedges between two pie charts, as in 
this case, the task is made harder again as the viewer has to ping 
back and forth between the wedges of each pie and the vertically 
oriented legend underneath. 

There is an additional wrinkle in this case. The variable bro¬ 
ken down in each pie chart is not only categorical, it is also ordered 
from low to high. The data describe the percent of all borrowers 
and the percent of all balances divided up across the size of bal¬ 
ances owed, from less than five thousand dollars to more than two 
hundred thousand dollars. It’s one thing to use a pie chart to dis¬ 
play shares of an unordered categorical variable, such as percent 
of total sales due to pizza, lasagna, and risotto. Keeping track of 
ordered categories in a pie chart is harder again, especially when 
we want to make a comparison between two distributions. The 
wedges of these two pie charts are ordered (clockwise, from the 
top), but its not so easy to follow them. This is partly because of 
the pie-ness of the chart and partly because the color palette cho¬ 
sen for the categories is not sequential. Instead it is unordered. The 
colors allow the debt categories to be distinguished but don’t pick 
out the sequence from low to high values. 

So not only is a less than ideal plot type being used here, 
it’s being made to do a lot more work than usual, and with the 
wrong sort of color palette. As is often the case with pie charts, 
the compromise made to facilitate interpretation is to display all 
the numerical values for every wedge, and also to add a summary 
outside the pie. If you find yourself having to do this, it’s worth 
asking whether the chart could be redrawn, or whether you might 
as well just show a table instead. 

Here are two ways we might redraw these pie charts. As usual, 
neither approach is perfect. Or rather, each approach draws atten¬ 
tion to features of the data in slightly different ways. Which works 
best depends on what parts of the data we want to highlight. The 
data are in an object called studebt. 

head(studebt) 

## # A tibble: 6x4 

## Debt type pet Debtrc 

## <ord> <fct> <int> <ord> 

## 1 Under $5 Borrowers 20 Under $5 
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Figure 8.25: Faceting the pie charts. 


Outstanding student loans 

44 million borrowers owe a total of $1.3 trillion 


Over $200 
$ 100—$200 
$75—$100 
$50—$75 
$25—$50 
$10—$25 
$5—$10 
Under $5 

0 % 10 % 20 % 


Percent of all borrowers 



Percent of all balances 



Source: FRB NY 


## 2 $5-$10 Borrowers 
## 3 $10-$25 Borrowers 
## 4 $25-$50 Borrowers 
## 5 $50-$75 Borrowers 
## 6 $75-$100 Borrowers 


17 $5—$10 
28 $10-$25 
19 $25-$50 
8 $50-$75 
3 $75-$100 


Our first effort to redraw the pie charts (fig. 8.25) uses a faceted 
comparison of the two distributions. We set up some labels in 
advance, as we will reuse them. We also make a special label for 
the facets. 


p_xlab •«- "Amount Owed, in thousands of Dollars" 
p_title "Outstanding Student Loans" 

p_subtitle •«- "44 million borrowers owe a total of $1.3 trillion" 
p_caption 4- "Source: FRB NY" 

f_labs c('Borrowers' = "Percent of\nall Borrowers", 

'Balances' = "Percent of\nall Balances") 

p ggplot(data = studebt, 

mapping = aes(x = Debt, y = pet/100, fill = type)) 
p + geom_bar(stat = "identity") + 

scale_fill_brewer(type = "qual", palette = "Dark2") + 
scale_y_continuous(labels = scales::percent) + 
guides(flll = FALSE) + 

theme(strip.text.x = element_text(face = "bold")) + 
labs(y = NULL, x = p_xlab, 









caption = p_caption, 
title = p_title, 
subtitle = p_subtitle) + 

facet_grid(~ type, labeller = as_labeller(f_labs)) + 
coorcL flipQ 


There is a reasonable amount of customization in this graph. 
First, the text of the facets is made bold in the theme () call. The 
graphical element is first named (strip.text.x) and then mod¬ 
ified using the element_text() function. We also use a custom 
palette for the fill mapping, via scale_fill_brewer(). And finally 
we relabel the facets to something more informative than their 
bare variable names. This is done using the labeller argument 
and the as_labeller() function inside the facet_grid() call. 
At the beginning of the plotting code, we set up an object called 
f_labs, which is in effect a tiny data frame that associates new 
labels with the values of the type variable in studebt. We use 
backticks (the angled quote character located next to the T key 
on U.S. keyboards) to pick out the values we want to relabel. The 
as_labeller() function takes this object and uses it to create new 
text for the labels when facet_grid() is called. 

Substantively, how is this plot better than the pie charts? We 
split the data into the two categories and showed the percentage 
shares as bars. The percent scores are on the x-axis. Instead of 
using color to distinguish the debt categories, we put their values 
on the y-axis instead. This means we can compare within a cat¬ 
egory just by looking down the bars. For instance, the left-hand 
panel shows that almost a fifth of the 44 million people with stu¬ 
dent debt owe less than five thousand dollars. Comparisons across 
categories are now easier as well, as we can scan across a row to see, 
for instance, that while just 1 percent or so of borrowers owe more 
than $200,000, that category accounts for more than 10 percent of 
all student debt. 

We could also have made this bar chart by putting the per¬ 
centages on the y-axis and the categories of amount owed on the 
x-axis. When the categorical axis labels are long, though, I gen¬ 
erally find it’s easier to read them on the y-axis. Finally, while it 
looks nice and helps a little to have the two categories of debt dis¬ 
tinguished by color, the colors on the graph are not encoding or 
mapping any information in the data that is not already taken care 
of by the faceting. The fill mapping is useful but also redundant. 
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Outstanding student loans 

44 million borrowers owe a total of $1.3 trillion 


Percent of 
all balances 

Percent of 
all borrowers 


0% 25% 50% 75% 100% 

Source: FRB NY 



Figure 8.26: Debt distributions as horizontally segmented bars. 


This graph could easily be in black and white and would be just as 
informative. 

One thing that is not emphasized in a faceted chart like this 
is the idea that each of the debt categories is a share or percentage 
of a total amount. That is what a pie chart emphasizes more than 
anything, but as we saw there’s a perceptual price to pay for that, 
especially when the categories are ordered. But maybe we can hang 
on to the emphasis on shares by using a different kind of barplot. 
Instead of having separate bars distinguished by heights, we can 
array the percentages for each distribution proportionally within 
a single bar. We will make a stacked bar chart with just two main 
bars, (fig. 8.26) and he them on their side for comparison. 

library(viridis) 

p <- ggplot(studebt, aes(y = pet/100, x = type, fill = Debtrc)) 
p + geom_bar(stat = "identity", color = "gray80") + 
scale_x_discrete(labels = as_labeller(f_labs)) + 
scale_y_continuous(labels = scales::percent) + 
scale_fill_viridis(discrete = TRUE) + 
guides (fill = guide_legend(reverse = TRUE, 

title.position = "top", 
label.position = "bottom", 
keywidth = 3, 
nrow = 1)) + 
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labs(x = NULL, y = NULL, 

fill = "Amount Owed, in thousands of dollars", 
caption = p_caption, 
title = p_title, 
subtitle = p_subtitle) + 
theme(legend.position = "top", 

axis.text.y = element_text(face = "bold", hjust = 1, size = 12), 
axis.ticks.length = unit(0, "cm"), 
panel.grid.major.y = element_blank()) + 
coord_flip() 

Once again, there is a substantial amount of customization 
in this chart. I encourage you to peel it back one option at a 
time to see how it changes. We use the as_labeller() with 
f_labs again, but in the labels for the x-axis this time. We 
make a series of adjustments in the theme () call to customize 
the purely visual elements of the plot, making the y-axis labels 
larger, right justified, and bold via element_text(), removing 
the axis tick-marks, and also removing the y-axis grid lines via 
element_blank(). 

More substantively, we take a lot of care about color in 
figure 8.26. First, we set the border colors of the bars to a light 
gray in geom_bar() to make the bar segments easier to distinguish. 
Second, we draw on the viridis library again (as we did with our 
small-multiple maps in chapter 7), using scale_fill_viridis() 
for the color palette. Third, we are careful to map the income cate¬ 
gories in an ascending sequence of colors, and to adjust the key so 
that the values run from low to high, from left to right, and from 
yellow to purple. This is done partly by switching the fill mapping 
from Debt to Debtrc. The categories of the latter are the same as 
the former, but the sequence of income levels is coded in the order 
we want. We also show the legend to the reader first by putting it 
at the top, under the title and subtitle. 

The rest of the work is done in the guides () call. We have not 
used guides () much thus far except to turn off legends that we 
did not want to display. But here we see its usefulness. We give 
guides () a series of instructions about the fill mapping: reverse 
the direction of the color coding; put the legend title above the key; 
put the labels for the colors below the key; widen the width of the 
color boxes a little; and place the whole key on a single row. 


reverse = TRUE 
title.position 
label.position 
keywidth 


nrow 
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This is a lot of work, but if you don’t do it the plot will be 
much harder to read. Again, I encourage you to peel away the 
layers and options in sequence to see how the plot changes. The 
version in figure 8.26 lets us more easily see how the categories of 
dollar amounts owed break down as a percentage of all balances, 
and as a percent of all borrowers. We can also eyeball compar¬ 
isons between the two types, especially at the far end of each scale. 
It’s easy to see how a tiny percentage of borrowers account for 
a disproportionately large share of total debt, for example. But 
even with all this careful work, estimating the size of each indi¬ 
vidual segment is still not as easy here as it is in figure 8.25, the 
faceted version. This is because it’s harder to estimate sizes when 
we don’t have an anchor point or baseline scale to compare each 
piece to. (In the faceted plot, that comparison point was the x- 
axis.) So the size of the “Under 5” segment in the bottom bar 
is much easier to estimate than the size of the “$10-25” bar, for 
instance. Our injunction to take care about using stacked bar 
charts still has a lot of force, even when we try hard to make the best 
of them. 


8.6 Where to Go Next 

We have reached the end of our introduction. From here on, you 
should be in a strong position to forge ahead in two main ways. 
The first is to become more confident and practiced with your cod¬ 
ing. Learning ggplot should encourage you to learn more about 
the set of tidyverse tools, and then by extension to learn more 
about R in general. What you choose to pursue will (and should) 
be driven by your own needs and interests as a scholar or data sci¬ 
entist. The most natural text to look at next is Garrett Grolemund 
and Hadley Wickham’s RforData Science (2016), which introduces 
tidyverse components that we have drawn on here but not pur¬ 
sued in depth. Other useful texts include Chang (2013) and Roger 
Peng’s R Programming for Data Science (2016). The most thorough 
introduction to ggplot in particular can be found in Wickham 
(2016). 

Pushing ahead to use ggplot for new kinds of graphs will even¬ 
tually get you to the point where ggplot does not quite do what 
you need or does not quite provide the sort of geom you want. In 
that case, the first place to look is the world of extensions to the 
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ggplot framework. Daniel Emaasit’s overview of add-on packages 
for ggplot is the best place to begin your search. We have used a few 
extensions in the book already. Like ggrepel and ggridges, exten¬ 
sions typically provide a new geom or two to work with, which 
may be just what you need. Sometimes, as with Thomas Lin Ped¬ 
ersens ggraph, you get a whole family of geoms and associated 
tools—in ggraph’s case, a suite of tidy methods for the visual¬ 
ization of network data. Other modeling and analysis tasks may 
require more custom work, or coding that is closely connected 
to the kind of analysis being done. Harrell (2016) provides many 
clearly worked examples, mostly based on ggplot; Gelman & Hill 
(2018) and Imai (2017) also introduce contemporary methods 
using R; Silge & Robinson (2017) present a tidy approach to ana¬ 
lyzing and visualizing textual data; while Lriendly & Meyer (2017) 
thoroughly explore the analysis of discrete data, an area that is 
often challenging to approach visually. 

The second way you should push ahead is by looking at and 
thinking about other people’s graphs. The R Graph Gallery, run 
by Yan Holtz, is a useful collection of examples of many kinds of 
graphics drawn with ggplot and other R tools. PolicyViz, a site run 
by Jon Schwabish, covers a range of topics on data visualization. 
It regularly features case studies where visualizations are reworked 
to improve them or cast new light on the data they present. But do 
not just look for examples that have code with them to begin with. 
As I have said before, a real strength of ggplot is the grammar of 
graphic that underpins it. That grammar is a model you can use to 
look at and interpret any graph, no matter how it was produced. It 
gives you a vocabulary that lets you say what the data, mappings, 
geoms, scales, guides, and layers of any particular graph might be. 
And because the grammar is implemented as the ggplot library, it is 
a short step from being able to anatomize the structure of a graph 
to being able to sketch an outline of the code you could write to 
reproduce it yourself. 

While its underlying principles and goals are relatively stable, 
the techniques and tools of research are changing. This is especially 
true within the social sciences (Salganik 2018). Data visualization 
is an excellent entry point to these new developments. Our tools 
for it are more versatile and powerful than ever. You should look 
at your data. Looking is not a replacement for thinking. It can¬ 
not force you to be honest; it cannot magically prevent you from 
making mistakes; and it cannot make your ideas true. But if you 


ggplot2-exts.org 


r-graph-gallery.com 

policyviz.com 
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analyze data, visualization can help you uncover features in it. If 
you are honest, it can help you live up to your own standards. 
When you inevitably make errors, it can help you find and cor¬ 
rect them. And if you have an idea and some good evidence for it, 
it can help you show it in a compelling way. 
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This appendix contains supplemental information about various 
aspects of R and ggplot that you are likely to run into as you use 
it. You are at the beginning of a process of discovering practical 
problems that are an inevitable part of using software. This is often 
frustrating. But feeling stumped is a standard experience for every¬ 
one who writes code. Each time you figure out the solution to your 
problem, you acquire more knowledge about how and why things 
go wrong, and more confidence about how to tackle the next glitch 
that comes along. 


1 A little more about R 
How to read an R help page 

Functions, datasets, and other built-in objects in R are docu¬ 
mented in its help system. You can search or browse this documen¬ 
tation via the “Help” tab in RStudio’s lower right-hand window. 
The quality of R’s help pages varies somewhat. They tend to be on 
the terse side. However, they all have essentially the same structure, 
and it is useful to know how to read them. Figure A.l provides an 
overview of what to look for. Remember, functions take inputs, 
perform actions, and return outputs. Something goes in, it gets 
worked on, and then something comes out. That means you want 
to know what the function requires, what it does, and what it 
returns. What it requires is shown in the Usage and Arguments 
sections of the help page. The names of the required and optional 
arguments are given in the order the function expects them. Some 
arguments have default values. In the case of the mean() function, 
the argument na. rm is set to FALSE by default. These willbe shown 
in the Usage section. If a named argument has no default, you 
will have to give it a value. Depending on what the argument is, 
this might be a logical value, a number, a dataset, or any other 
object. 
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The name of the 
function, and i 
the library it is in. 


What it does. 



More details on each 
named argument. This 
will tell you what class 
of thing each argument 
has to be—an object, a 
number, a data frame, 
a logical value, etc. 


1 


What the function 
returns—i.e., the result 
of whatever operation 
orcalculation it 
performs. This can be 
a single number, as 
here, ora multi-part 
object such as a list, a 
data frame, a plot, or a 
model. 




mean {base} R Documentation 

Arithmetic Mean 


Description 

Generic function for the (trimmed) arithmetic mean. 


Usage 

mean(x, ...) 

## Default S3 method: 

mean(x, trim = 0, na.rm = FALSE, ...) 

Arguments 



The function’s name, and in the 
parentheses the named arguments it 
expects, in the order it expects them. If 
an argument has a default value, it is 
shown. Arguments without default 
values (e.g. x) must be provided by you. 


x An R object. Currently there are methods for numeric/logical vectors and date , date-time and 
time interval objects. Complex vectors are allowed for trim = 0 , only, 
t ri m the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is 
computed. Values of trim outside that range are taken as the nearest endpoint, 
na.rm a logical value indicating whether NA values should be stripped before the computation 
proceeds. 

... further arguments passed to or from other methods. 


Value 


b 


The ellipsis allows other arguments to 
be passed to and from the function. 


If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex 
vector of length one. If x is not logical (coerced to numeric), numeric (including integer) or complex, 
NA_real_ is returned, with a warning. 

If trim is non-zero, a symmetrically trimmed mean is computed with a fraction of trim observations 
deleted from each end before the mean is computed. 


References 

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth 8c Brooks/Cole. 


See Also 


wei ghted. mean , mean . POSIXct , colMeans for row and column means. 


Examples 



Other related functions 


x <- c(0:10, 50) 
xm <- mean(x) 

c(xm, mean(x, trim = 0.10)) 



Self-contained examples that you can 
run at the console. These may use 
built-in datasets or other R functions. 


[Package base version 3.4.3 Indexl 



Visit the package’s Index 
page to look for Demos 
and Vignettes detailing 
how it works. 


Figure A.1:The structure of an R help page. 


The other part to look at closely is the Value section, which 
tells you what the function returns once it has done its calculation. 
Again, depending on what the function is, this might simply be 
a single number or other short bit of output. But it could also be 
something as complex as a ggplot figure or a model object consist¬ 
ing of many separate parts organized as a list. 
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Well-documented packages will often have demos and vignettes 
attached to them. These are meant to describe the package as a 
whole, rather than specific functions. A good package vignette will 
often have one or more fully worked examples together with a dis¬ 
cussion describing how the package works and what it can do. To 
see if there are any package vignettes, click the link at the bottom of 
the function’s help page to be taken to the package index. Any avail¬ 
able demos, vignettes, or other general help will be listed at the top. 

The basics of accessing and selecting things 

Generally speaking, the tidyverse’s preferred methods for data sub- 
setting, filtering, slicing and selecting will keep you away from 
the underlying mechanics of selecting and extracting elements of 
vectors, matrices, or tables of data. Carrying out these opera¬ 
tions through functions like selectQ, filter(), subsetQ, and 
mergeQ is generally safer and more reliable than accessing ele¬ 
ments directly. However, it is worth knowing the basics of these 
operations. Sometimes accessing elements directly is the most 
convenient thing to do. More important, we may use these tech¬ 
niques in small ways in our code with some regularity. Here we 
very briefly introduce some of R’s selection operators for vectors, 
arrays, and tables. 

Consider the my_nuinbers and your_numbers vectors again. 

my_numbers <- c(1, 2, 3, 1, 3, 5, 25) 
your_numbers <- c(5, 31, 71, 1, 3, 21, 6) 

To access any particular element in my_numbers, we use square 
brackets. Square brackets are not like the parentheses after func¬ 
tions. They are used to pick out an element indexed by its position: 


my_numbers[4] 


## [ 1 ] 1 


my_numbers[7] 


## [1] 25 
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Putting the number n inside the brackets will give us (or 
“return”) the nth element in the vector, assuming there is one. To 
access a sequence of elements within a vector we can do this: 

my_numbers[2:4] 

## [1] 2 3 1 

This shorthand notation tells R to count from the second to 
the fourth element, inclusive. We are not restricted to selecting 
contiguous elements either. We can make use of our c() function 
again: 


my_numbers[c(2, 4)] 

## [ 1 ] 2 1 

R evaluates the expression c (2,4) first and then extracts 
just the second and the fourth element from my_numbers, ignor¬ 
ing the others. You might wonder why we didn’t just write 
my_numbers[2,3] directly. The answer is that this notation is used 
for objects arrayed in two dimensions (i.e., something with rows 
and columns), like matrices, data frames, or tibbles. We can make 
a two-dimensional object by creating two different vectors with 
the c() function and using the tibbleQ function to collect them 
together: 

my_tb <- tibble(mine = c(1, 4, 5, 8:11), yours = c(3, 20, 16, 

34:31)) 


class(my_tb) 

## [ 1 ] 

"tbl_df" "tbl" 

"data.frame" 

my_tb 

## # A tibble: 7x2 


## 

mine yours 


## <dbl> <dbl> 


## 1 

1. 3. 


## 2 

4. 20. 


## 3 

5. 16. 


## 4 

CO 

CO 
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## 5 

9. 

33 

## 6 

10. 

32 

## 7 

11. 

31 


We index data frames, tibbles, and other arrays by row 
first, and then by column. Arrays may also have more than two 
dimensions. 

my_tb[3, 1] # Row 3 Col 1 


In these chunks of code you will see some 
explanatory text set off by the hash symbol, tt. 

In R's syntax, the hash symbol is used to designate 
a comment. On any line of code, text that appears 
after a # symbol will be ignored by R's interpreter. 
It won't be evaluated and it won't trigger a syntax 
error. 


## # A tibble: 1 x 1 
## mine 
## <dbl> 

## 1 5. 


my_tb[1, 2] # Row 1, Col 2 
## # A tibble: 1 x 1 
## yours 
## <dbl> 

## 1 3. 

The columns in our tibble have names. We can select ele¬ 
ments using them, too. We do this by putting the name of the 
column in quotes where we previously put the index number of the 
column: 


my_tb[3, "mine"] # Row 3 Col 1 

## # A tibble: 1 x 1 
## mine 
## <dbl> 

## 1 5. 


my_tb[l, "yours"] # Row 1, Col 2 


## # A tibble: 1 x 1 
## yours 
## <dbl> 

## 1 3. 


my_tb[3, "mine"] # Row 3 Col 1 
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## # A tibble: 1 x 1 
## mine 
## <dbl> 

## 1 5. 


my_tb[1, "yours"] # Row 1, Col 2 

## # A tibble: 1 x 1 
## yours 
## <dbl> 

## 1 3. 

If we want to get all the elements of a particular column, we 
can leave out the row index. This will mean all the rows will be 
included for whichever column we select. 

my_tb[, "mine"] # All rows, Col 1 

## # A tibble: 7 x 1 
## mine 
## <dbl> 

## 1 1 . 

## 2 4. 

## 3 5. 

## 4 8. 

## 5 9. 

## 6 10 . 

## 7 11. 

We can do this the other way around, too, selecting a particular 
row and showing all columns: 

my_tb[4, ] # Row 4, all cols 


## # 

A tibble: 1 x 2 

## 

mine 

yours 

## 

<dbl> 

<dbl> 

## 1 

8. 

34. 


A better way of accessing particular columns in a data frame is 
via the $ operator, which can be used to extract components of var¬ 
ious sorts of object. This way we append the name of the column 
we want to the name of the object it belongs to: 
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my_tb$mine 

## [ 1 ] 1 4 5 8 9 10 11 

Elements of many other objects can be extracted in this way, 
too, including nested objects. 

out lm(mine ~ yours, data = my_tb) 

out$coefficients 

## (Intercept) yours 

## - 0.0801192 0.2873422 

out$call 

## lm(formula = mine ~ yours, data = my_tb) 
out$qr$rank # nested 


## [ 1 ] 2 

Finally, in the case of data frames, the $ operator also lets us 
add new columns to the object. For example, we can add the first 
two columns together, row by row. To create a column in this way, 
we put the $ and the name of the new column on the left side of 
the assignment. 


my_tb$ours my_tb$mine + my_tb$yours 
my_tb 

## # A tibble: 7x3 
## mine yours ours 
## <dbl> <dbl> <dbl> 

## 1 1 . 3 . 4 . 

## 2 4 . 20 . 24 . 

## 3 5 . 16 . 21 . 

## 4 8 . 34 . 42 . 

## 5 9 . 33 . 42 . 

## 6 10 . 32 . 42 . 

## 7 11 . 31 . 42 . 
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Table A.l 
Some untidy data. 


name 

treatmenta 

treatmentb 

John Smith 

NA 

18 

Jane Doe 

4 

1 

Mary Johnson 

6 

7 


Table A.2 

The same data, still untidy, but in a 
different way. 


treatment 

John 

Jane 

Mary 


Smith 

Doe 

Johnson 

a 

NA 

4 

6 

b 

18 

1 

7 


Table A.3 

Tidied data. Every variable a col¬ 
umn, every observation a row. 


name 

treatment 

n 

Jane Doe 

a 

4 

Jane Doe 

b 

1 

John Smith 

a 

NA 

John Smith 

b 

18 

Mary Johnson 

a 

6 

Mary Johnson 

b 

7 


In this book we do not generally access data via [ or $. It is 
particularly bad practice to access elements by their index num¬ 
ber only, as opposed to using names. In both cases, and especially 
the latter, it is too easy to make a mistake and choose the wrong 
columns or rows. In addition, if our table changes shape later on 
(e.g., due to the addition of new original data), then any absolute 
reference to the position of columns (rather than to their names) is 
very likely to break. Still, we do use the c () function for small tasks 
quite regularly, so it’s worth understanding how it can be used to 
pick out elements from vectors. 


Tidy data 

Working with R and ggplot is much easier if the data you use is 
in the right shape. Ggplot wants your data to be tidy. For a more 
thorough introduction to the idea of tidy data, see chapters 5 and 
12 of Wickham & Grolemund (2016). To get a sense of what a tidy 
dataset looks like in R, we will follow the discussion in Wickham 
(2014). In a tidy dataset, 

1. Each variable is a column. 

2. Each observation is a row. 

3. Each type of observational unit forms a table. 

For most of your data analysis, the first two points are the most 
important. The third point might be a little unfamiliar. It is a fea¬ 
ture of “normalized” data from the world of databases, where the 
goal is to represent data in a series of related tables with minimal 
duplication (Codd 1990). Data analysis more usually works with 
a single large table of data, often with considerable duplication of 
some variables down the rows. 

Data presented in summary tables is often not “tidy” as defined 
here. When structuring our data, we need to be clear about how 
our data is arranged. If your data is not tidily arranged, the chances 
are good that you will have more difficulty, and maybe a lot more 
difficulty, getting ggplot to draw the figure you want. 

For example, consider table A. 1 and table A.2, from Wickham’s 
discussion. They present the same data in different ways, but each 
would cause trouble if we tried to work with it in ggplot to make 
a graph. Table A.3 shows the same data once again, this time in a 
tidied form. 
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Table A-1. Years of school completed by people 25 years and over, by age and sex: selected years 1940 to 2016 
(Numbers in thousands. Noninstitutionalized population except where otherwise specified.) 


Age, sex, 
and years 

Total 

Years of school completed 

Elementary 

High school 

College 

Median 

0 to 4 years 

5 to 8 years 

1 to 3 years 

4 years 

1 to 3 years 

4 years or 

more 

25 years and older 









Male 









2016 

103,372 

1,183 

3,513 

7,144 

30,780 

26,468 

34,283 

(NA) 

2015 

101,887 

1,243 

3,669 

7,278 

30,997 

25,778 

32,923 

(NA) 

2014 

100,592 

1,184 

3,761 

7,403 

30,718 

25,430 

32,095 

(NA) 

2013 

99,305 

1,127 

3,836 

7,314 

30,014 

25,283 

31,731 

(NA) 

2012 

98,119 

1,237 

3,879 

7,388 

30,216 

24,632 

30,766 

(NA) 

2011 

97,220 

1,234 

3,883 

7,443 

30,370 

24,319 

29,971 

(NA) 

2010 

96,325 

1,279 

3,931 

7,705 

30,682 

23,570 

29,158 

(NA) 

2009 

95,518 

1,372 

4,027 

7,754 

30,025 

23,634 

28,706 

(NA) 

2008 

94,470 

1,310 

4,136 

7,853 

29,491 

23,247 

28,433 

(NA) 

2007 

93,421 

1,458 

4,249 

8,294 

29,604 

22,219 

27,596 

(NA) 

2006 

92,233 

1,472 

4,395 

7,940 

29,380 

22,136 

26,910 

(NA) 

2005 

90,899 

1,505 

4,402 

7,787 

29,151 

21,794 

26,259 

(NA) 


Figure A.2: Untidy data from the census. 


Hadley Wickham notes five main ways tables of data tend not 
to be tidy: 

1. Column headers are values, not variable names. 

2. Multiple variables are stored in one column. 

3. Variables are stored in both rows and columns. 

4. Multiple types of observational units are stored in the same 
table. 

5. A single observational unit is stored in multiple tables. 

Data comes in an untidy form all the time, often for the good 
reason that it can be presented that way using much less space, 
or with far less repetition of labels and row elements. Figure A.2 
shows a the first few rows of a table of U.S. Census Bureau data 
about educational attainment in the United States. To begin with, 
it’s organized as a series of subtables down the spreadsheet, bro¬ 
ken out by age and sex. Second, the underlying variable of interest, 
“Years of School Completed,” is stored across several columns, 
with an additional variable (level of schooling) included across the 
columns also. It is not hard to get the table into a slightly more reg¬ 
ular format by eliminating the blank rows and explicitly naming 
the subtable rows. One can do this manually and get to the point 
where it can be read in as an Excel or CSV file. This is not ideal, 
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as manually cleaning data runs against the commitment to do as 
r@adxi.tidyverse.org much as possible programmatically. We can automate the process 

somewhat. The tidyverse comes with a readxl package that tries 
to ease the pain. 


## # A tibble: 366 x 11 


## 


age 

sex 

year 

total 

CD 

CD 

3 

elem8 

hs3 

hs4 

coll3 

coll4 

median 

## 


<chr> 

<chr> 

<int> 

<int> 

<int> 

<int> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

<dbl> 

## 

1 

25-34 

Male 

2016 

21845 

116 

468 

1427. 

6386. 

6015. 

7432. 

NA 

## 

2 

25-34 

Male 

2015 

21427 

166 

488 

1584. 

6198. 

5920. 

7071. 

NA 

## 

3 

25-34 

Male 

2014 

21217 

151 

512 

1611. 

6323. 

5910. 

6710. 

NA 

## 

4 

25-34 

Male 

2013 

20816 

161 

582 

1747. 

6058. 

5749. 

6519. 

NA 

## 

5 

25-34 

Male 

2012 

20464 

161 

579 

1707. 

6127. 

5619. 

6270. 

NA 

## 

6 

25-34 

Male 

2011 

20985 

190 

657 

1791. 

6444. 

5750. 

6151. 

NA 

## 

7 

25-34 

Male 

2010 

20689 

186 

641 

1866. 

6458. 

5587. 

5951. 

NA 

## 

8 

25-34 

Male 

2009 

20440 

184 

695 

1806. 

6495. 

5508. 

5752. 

NA 

## 

9 

25-34 

Male 

2008 

20210 

172 

714 

1874. 

6356. 

5277. 

5816. 

NA 

## 

10 

25-34 

Male 

2007 

20024 

246 

757 

1930. 

6361. 

5137. 

5593. 

NA 


## # ... with 356 more rows 

The tidyverse has several tools to help you get the rest of the 
way in converting your data from an untidy to a tidy state. These 
can mostly be found in the tidyr and dplyr packages. The for¬ 
mer provides functions for converting, for example, wide-format 
data to long-format data, as well as assisting with the business of 
splitting and combining variables that are untidily stored. The lat¬ 
ter has tools that allow tidy tables to be further filtered, sliced, and 
analyzed at different grouping levels, as we have seen throughout 
this book. 

With our edu object, we can use the gatherQ function to 
transform the schooling variables into a key-value arrangement. 
The key is the underlying variable, and the value is the value 
it takes for that observation. We create a new object, edu_t in 
this way. 

edu_t gather(data = edu, 
key = school, 
value = freq, 
elem4:coll4) 


head(edu_t) 
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## # 

A tibble: 6 

7 





## 

age 

sex 

year 

total median 

school 

f req 

## 

<chr> 

<chr> 

;int> 

<int> 

<dbl> 

<chr> 

<dbl> 

## 1 

25-34 

Male 

2016 

21845 

NA 

elem4 

116. 

## 2 

25-34 

Male 

2015 

21427 

NA 

elem4 

166. 

## 3 

25-34 

Male 

2014 

21217 

NA 

elem4 

151. 

## 4 

25-34 

Male 

2013 

20816 

NA 

elem4 

161. 

## 5 

25-34 

Male 

2012 

20464 

NA 

elem4 

161. 

## 6 

25-34 

Male 

2011 

20985 

NA 

elem4 

190. 

tail(edu_t) 

## # 

A tibble: 6 

7 





## 

age 

sex 

year 

total 

median 

school 

f req 

## 

<chr> 

<chr> 

<int> 

<int> 

<dbl> 

<chr> 

<dbl> 

## 1 

55> 

Female 

1959 

16263 

8.30 

coll4 

688. 

## 2 

55> 

Female 

1957 

15581 

8.20 

coll4 

630. 

## 3 

55> 

Female 

1952 

13662 

7.90 

coll4 

628. 

## 4 

55> 

Female 

1950 

13150 

8.40 

coll4 

436. 

#1 5 

55> 

Female 

1947 

11810 

7.60 

coll4 

343. 

## 6 

55> 

Female 

1940 

9777 

8.30 

coll4 

219. 


The educational categories previously spread over the columns 
have been gathered into two new columns. The school variable 
is the key column. It contains all the education categories that 
were previously given across the column headers, from zero to 
four years of elementary school to four or more years of college. 
They are now stacked up on top of each other in the rows. The 
f req variable is the value column and contains the unique value of 
schooling for each level of that variable. Once our data is in this 
long-form shape, it is ready for easy use with ggplot and related 
tidyverse tools. 


2 Common Problems Reading in Data 
Date formats 

Date formats can be annoying. First, times and dates must be 
treated differently from ordinary numbers. Second, there are many 
different date formats, differing both in the precision with which 
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60000 


40000 - 


20000 - 



Date 


Figure A.3: A bad date. 


they are stored and in the convention they follow about how to 
display years, months, days, and so on. Consider the following 
data: 

head(bad_date) 


## # A tibble: 6x2 
## date N 

## <chr> <int> 

## 1 9/1/11 44426 
## 2 9/2/11 55112 
## 3 9/3/11 19263 
## 4 9/4/11 12330 
## 5 9/5/11 8534 
## 6 9/6/11 59490 


The data in the date column has been read in as a character 
string, but we want R to treat it as a date. If can’t treat it as a date, 
we get bad results (fig. A.3). 

p ggplot(data = bad_date, aes(x = date, y = N)) 
p + geom_line() 


20000 - 
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Figure A.4: Still bad. 


## geom_path: Each group consists of only one observation. 

## Do you need to adjust the group aesthetic? 

What has happened? The problem is that ggplot doesn’t know 
date consists of dates. As a result, when we ask to plot it on the 
x-axis, it tries to treat the unique elements of date like a categor¬ 
ical variable instead. (That is, as a factor.) But because each date 
is unique, its default effort at grouping the data results in every 
group having only one observation in it (i.e., that particular row). 
The ggplot function knows something is odd about this and tries 
to let you know. It wonders whether we’ve failed to set group = 
<something> in our mapping. 

For the sake of it, let’s see what happens when the bad date 
values are not unique. We will make a new data frame by stacking 
two copies of the data on top of each other. The rbindQ function 
does this for us. We end up in figure A.4 with two copies of every 
observation. 
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bad_date2 •<- rbind(bad_date, bad_date) 

p •<- ggplot(data = bad_date2, aes(x = date, y = N)) 
p + geom_line() 


Now ggplot doesn’t complain at all, because there’s more 
than one observation per (inferred) group. But the plot is still 
wrong! 

We will fix this problem using the lubridate package. It pro¬ 
vides a suite of convenience functions for converting date strings 
in various formats and with various separators (such as / or -) 
into objects of class Date that R knows about. Here our bad dates 
are in a month/day/year format, so we use mdy(). Consult the 
lubridate package’s documentation to learn more about similar 
convenience functions for converting character strings where the 
date components appear in a different order. 


# install.packages('lubridate') 
library(lubridate) 

bad_date$date mdy(bad_date$date) 
head(bad_date) 


## # A tibble: 6x2 
## date N 
## <date> <int> 


## 1 2011-09-01 44426 
## 2 2011-09-02 55112 
## 3 2011-09-03 19263 
## 4 2011-09-04 12330 
## 5 2011-09-05 8534 
## 6 2011-09-06 59490 

Now date has a Date class. Let’s try the plot again (fig. A.5). 


p <- ggplot(data = bad_date, aes(x = date, y = N)) 
p + geom_line() 



Figure A.5: Much better. 
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Year-only dates 

Many variables are measured by the year and supplied in the data 
as a four-digit number rather than as a date. This can sometimes 
cause headaches when we want to plot year on the x-axis. It hap¬ 
pens most often when the time series is relatively short. Consider 
this data: 


url "https://cdn. rawgit.com/kjhealy/viz-organdata/iiiaster/organdonation.csv" 

bad_year read_csv(url) 

bad_year '/.>l select(l:3) sample_n(10) 


## # A tibble: 10 x 3 


## 


country 

year 

donors 

## 


<chr> 

<int> 

<dbl> 

## 

1 

United States 

1994 

19.4 

## 

2 

Australia 

1999 

8.67 

## 

3 

Canada 

2001 

13.5 

## 

4 

Australia 

1994 

10.2 

## 

5 

Sweden 

1993 

15.2 

## 

6 

Ireland 

1992 

19.5 

## 

7 

Switzerland 

1997 

14.3 

## 

8 

Ireland 

2000 

17.6 

## 

9 

Switzerland 

1998 

15.4 

## 

10 Norway 

NA 

NA 



1992.5 1995.0 1997.5 2000.0 2002.5 

Year 

Figure A.6: Integer year shown with a decimal point. 


This is a version of organdata but in a less clean format. The 
year variable is an integer (its class is <int>) and not a date. Let’s 
say we want to plot donation rate against year. 


p <- ggplot(data = bad_year, aes(x = year, y = donors)) 
p + geom_point() 


The decimal point on the x-axis labels in figure A.6 is un¬ 
wanted. We could sort this out cosmetically, by giving scale_x_ 
continuous () a set of breaks and labels that represent the years 
as characters. Alternatively, we can change the class of the year 
variable. For convenience, we will tell R that the year variable 






should be treated as a date measure and not an integer. We’ll use 
a home-cooked function, int_to_year(), that takes integers and 
converts them to dates. 
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bad_year$year •<- int_to_year(bad_year$year) 
bad_year select(l:3) 


## # A tibble: 238 x 3 


## 


country 

year 

donors 

## 


<chr> 

<date> 

<dbl> 

## 

1 

Australia 

NA 

NA 

## 

2 

Australia 

1991-01-01 

12.1 

## 

3 

Australia 

1992-01-01 

12.4 

## 

4 

Australia 

1993-01-01 

12.5 

## 

5 

Australia 

1994-01-01 

10.2 

## 

6 

Australia 

1995-01-01 

10.2 

## 

7 

Australia 

1996-01-01 

10.6 

## 

8 

Australia 

1997-01-01 

10.3 

## 

9 

Australia 

1998-01-01 

10.5 

## 10 Australia 

1999-01-01 

8.67 


## # ... with 228 more rows 

In the process, today’s day and month are introduced into the 
year data, but that is irrelevant in this case, given that our data is 
observed only in a yearly window to begin with. However, if you 
wish to specify a generic day and month for all the observations, 
the function allows you to do this. 


Write functions for repetitive tasks 

If you are working with a dataset that you will be making a lot of 
similar plots from or will need to periodically look at in a way that 
is repetitive but can’t be carried out in a single step once and for all, 
then the chances are that you will start accumulating sequences 
of code that you find yourself using repeatedly. When this hap¬ 
pens, the temptation will be to start copying and pasting these 
sequences from one analysis to the next. We can see something 
of this tendency in the code samples for this book. To make the 
exposition clearer, we have periodically repeated chunks of code 
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that differ only in the dependent or independent variable being 
plotted. 

Try to avoid copying and pasting code repeatedly in this way 
Instead, this is an opportunity to write a function to help you out a 
little. Almost everything in R is accomplished through functions, 
and it’s not difficult to write your own. This is especially the case 
when you begin by thinking of functions as a way to help you 
automate some local or smaller task rather than a means of accom¬ 
plishing some very complex task. R has the resources to help you 
build complex functions and function libraries, like ggplot itself. 
But we can start quite small, with functions that help us manage a 
particular dataset or data analysis. 

Remember, functions take inputs, perform actions, and return 
outputs. For example, imagine a function that adds two numbers, 
x and y. In use, it might look like this: 

add_xy(x = 1, y = 7) 

## [ 1 ] 8 

How do we create this function? Remember, everything is an 
object, so functions are just special kinds of object. And everything 
in R is done via functions. So if we want to make a new function, 
we will use an existing function to do it. In R, functions are created 
with functionQ: 


add_xy •<- function(x, y) { 
x + y 

} 


You can see that function () is a little different from ordinary 
functions in two ways. First, the arguments we give it (here, x 
and y) are for the adcLxy function that we are creating. Second, 
immediately after the function(x, y) statement there’s an open¬ 
ing brace, {, followed by a bit of R code that adds x and y, and then 
the closing brace }. That’s the content of the function. We assign 
this code to the adcLxy object, and now we have a function that 
adds two numbers together and returns the result. The x + y line 
inside the parentheses is evaluated as if it were typed at the console, 
assuming you have told it what x and y are. 
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add_xy(x = 5, y = 2) 

## [1] 7 

Functions can take many kinds of arguments, and we can also 
tell them what the default value of each argument should be by 
specifying it inside the function (...) section. Functions are little 
programs that have all the power of R at their disposal, including 
standard things like flow-control through if ... else statements 
and so on. Here, for instance, is a function that will make a scatter- 
plot for any section in the ASA data, or optionally fit a smoother 
to the data and plot that instead. Defining a function looks a little 
like calling one, except that we spell out the steps inside. We also 
specify the default arguments. 


plot_section <- function(section="Culture", x = "Year", 
y = "Members", data = asasec, 
smooth=FALSE){ 

require(ggplot2) 

require(splines) 

# Note use of aesstringO rather than aesQ 
P «- ggplot(subset(data, Sname==section), 
mapping = aes_string(x=x, y=y)) 

if(smooth == TRUE) { 

p0 ^— p + geom_smooth(color = "#999999", 

size =1.2, method = "lm", 
formula = y ~ ns(x, 3)) + 

scale_x_continuous(breaks = c(seq(2005, 2015, 4))) + 
labsftitle = section) 

} else { 

p0 ^— p + geom_line(color= "#E69F00", size=1.2) + 

scale_x_continuous(breaks = c(seq(2005, 2015, 4))) + 
labsftitle = section) 

1 

print(p0) 


This function is not very general. Nor is it particularly robust. 
But for the use we want to put it to (fig. A.7) it works just fine. 
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Figure A.7: Using a function to plot your results. 


Rationality Sexualities 




plot_section("Rationality") 
plot_section("Sexualities", smooth = TRUE) 

If we were going to work with this data for long enough, we 
could make the function progressively more general. For example, 
we can add the special ... argument (which means, roughly, 
“and any other named arguments”) to allow us to pass arguments 
through to the geom_smooth() function (fig. A.8) in the way wed 
expect if we were using it directly. With that in place, we can pick 
the smoothing method we want. 

plot_section <- function(section="Culture", x = "Year", 
y = "Members", data = asasec, 
smooth=FALSE, ...){ 

require(ggplot2) 

require(splines) 

# Note use of aesstringO rather than aes() 
p ggplot(subset(data, Sname==section), 
mapping = aes_string(x=x, y=y)) 

if(smooth == TRUE) { 

p0 p + geom_smooth(color = "#999999", 
size =1.2, ...) + 

scale_x_continuous(breaks = c(seq(2005, 2015, 4))) + 
labs(title = section) 

} else { 

p0 p + geom_line(color= "#E69F00", size=1.2) + 

scale_x_continuous(breaks = c(seq(2005, 2015, 4))) + 
labs(title = section) 

1 

print(p0) 

} 
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Comm/Urban 



Children 



plot_section("Comm/Urban", 
smooth = TRUE, 
method = "loess") 
plot_section("Children", 

smooth = TRUE, 
method = "lm", 
formula = y ~ ns(x, 2)) 


3 Managing Projects and Files 
RMarkdown and knitr 

Markdown is a loosely standardized way of writing plain text that 
includes information about the formatting of your document. It 
was originally developed by John Gruber, with input from Aaron 
Swartz. The aim was to make a simple format that could incor¬ 
porate some structural information about the document (such as 
headings and subheadings, emphasis, hyperlinks, lists, and foot¬ 
notes with minimal loss of readability in plain-text form. A plain¬ 
text format like HTML is much more extensive and well defined 
than Markdown, but Markdown was meant to be simple. Over the 
years, and despite various weaknesses, it has become a de facto 
standard. Text editors and note-taking applications support it, and 
tools exist to convert Markdown not just into HTML (its original 
target output format) but into many other document types as well. 
The most powerful of these is Pandoc, which can get you from 
Markdown to many other formats (and vice versa). Pandoc is what 
powers RStudio’s ability to convert your notes to HTML, Microsoft 
Word, and PDF documents. 


Figure A.8: Our custom function can now pass 
arguments along to fit different smoothers to 
section membership data. 


en.wikipedia.org/wiki/Harkdown 


pandoc.org 
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Chapter 1 encouraged you to take notes and organize your 
rmarkdown.rstudio.com analysis using RMarkdown and (behind the scenes) knitr. These 

yihui .name/knitr are R libraries that RStudio makes easy to use. RMarkdown 

extends Markdown by letting you intersperse your notes with 
chunks of R code. Code chunks can have labels and a few options 
that determine how they will behave when the file is processed. 
After writing your notes and your code, you knit the document 
(Xie 2015). That is, you feed your . Rmd file to R, which processes 
the code chunks and produces a new .md where the code chunks 
have been replaced by their output. You can then turn that Mark¬ 
down file into a more readable PDF or HTML document, or the 
Word document that a journal demands you send them. 

Behind the scenes in RStudio, this is all done using the knitr 
and rmarkdown libraries. The latter provides a render () function 
that takes you from . Rmd to HTML or PDF in a single step. Con¬ 
versely, if you just want to extract the code you’ve written from the 
surrounding text, then you “tangle” the file, which results in an . R 
file. The strength of this approach is that is makes it much easier 
to document your work properly. There is just one file for both the 
data analysis and the writeup. The output of the analysis is created 
on the fly, and the code to do it is embedded in the paper. If you 
need to do multiple but identical (or very similar) analyses of dif¬ 
ferent bits of data, RMarkdown and knitr can make generating 
consistent and reliable reports much easier. 

Pandoc’s flavor of Markdown is the one used in knitr and RStu¬ 
dio. It allows for a wide range of markup and can handle many 
of the nuts and bolts of scholarly writing, such as complex tables, 
citations, bibliographies, references, and mathematics. In addi¬ 
tion to being able to produce documents in various file formats, it 
can also produce many different kinds of document, from articles 
and handouts to websites and slide decks. RStudio’s RMarkdown 
website has extensive documentation and examples on the ins 
and outs of RMarkdown’s capabilities, including information on 
customizing it if you wish. 

Writing your notes and papers in a plain-text format like this 
has many advantages. It keeps your writing, your code, and your 
results closer together and allows you to use powerful version con¬ 
trol methods to keep track of your work and your results. Errors 
in data analysis often well up out of the gap that typically exists 
between the procedure used to produce a figure or table in a paper 
and the subsequent use of that output later. In the ordinary way of 
doing things, you have the code for your data analysis in one file, 
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the output it produced in another, and the text of your paper in a 
third file. You do the analysis, collect the output, and copy the rele¬ 
vant results into your paper, often manually reformatting them on 
the way. Each of these transitions introduces the opportunity for 
error. In particular, it is easy for a table of results to get detached 
from the sequence of steps that produced it. Almost everyone 
who has written a quantitative paper has been confronted with 
the problem of reading an old draft containing results or figures 
that need to be revisited or reproduced (as a result of peer review, 
say) but which lack any information about the circumstances of 
their creation. Academic papers take a long time to get through 
the cycle of writing, review, revision, and publication, even when 
you’re working hard the whole time. It is not uncommon to have 
to return to something you did two years previously in order to 
answer some question or other from a reviewer. You do not want 
to have to do everything over from scratch in order to get the right 
answer. Whatever the challenges of replicating the results of some¬ 
one else’s quantitative analysis, after a fairly short period of time 
authors themselves find it hard to replicate their own work. Bit-rot 
is the term of art in computer science for the seemingly inevitable 
process of decay that overtakes a project just because you left it 
alone on your computer for six months or more. 

For small and medium-sized projects, plain-text approaches 
that rely on RMarkdown documents and the tools described here 
work well. Things become a little more complicated as projects get 
larger. (This is not an intrinsic flaw of plain-text methods, by the 
way. It is true no matter how you choose to organize your project.) 
In general, it is worth trying to keep your notes and analysis in 
a standardized and simple format. The final outputs of projects 
(such as journal articles or books) tend, as they approach comple¬ 
tion, to descend into a rush of specific fixes and adjustments, all 
running against the ideal of a fully portable, reproducible analysis. 
It is worth trying to minimize the scope of the inevitable final 
scramble. 


Project organization 

Managing projects is a large topic of its own, about which many 
people have strong opinions. Your goal should be to make your 
code and data portable, reproducible, and self-contained. To 
accomplish this, use a project-based approach in RStudio. When 
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you start an analysis with some new data, create a new project 
containing the data and the R or RMarkdown code you will be 
working with. It should then be possible, in the ideal case, to move 
that folder to another computer that also has R, RStudio, and any 
required libraries installed, and successfully rerun the contents of 
the project. 

In practice that means two things. First, even though R is an 
object-oriented language, the only “real,” persistent things in your 
project should be the raw data files you start with and the code that 
operates on them. The code is what is real. Your code manipulates 
the data and creates all the objects and outputs you need. It’s pos¬ 
sible to save objects in R, but in general you should not need to do 
this for everyday analysis. 

Second, your code should not refer to any file locations out¬ 
side of the project folder. The project folder should be the “root” 
or ground floor for the files inside it. This means you should not 
use absolute file paths to save or refer to data or figures. Instead, 
use only relative paths. A relative path will start at the root of the 
project. So, for example, you should not load data with a command 
like this: 

## An absolute file path. Notice the leading '/' that starts 

## at the very top of the computer's file hierarchy. 

my_data read_csv("/Users/kjhealy/projects/gss/data/gss.csv") 

Instead, because you have an R project file started in the gss 
folder, you can use the he re () library to specify a relative path, like 
this: 

my_data read_csv(here("data", "gss.csv")) 

While you could type the relative paths out yourself, using 
hereQ has the advantage that it will work if, for example, you 
use Mac OS and you send your project to someone working on 
Windows. The same rule goes for saving your work, as we saw at 
the end of chapter 3, when you save individual plots as PDF or 
PNG files. 

Within your project folder, a little organization goes a long 
way. You should get in the habit of keeping different parts of the 
project in different subfolders of your working directory (fig. A.9). 
More complex projects may have a more complex structure, but 
you can go a long way with some simple organization. RMarkdown 
files can be in the top level of your working directory, with separate 




Appendix . 257 


scots-turnout 

= gn iipii 1 a v n§~ 

Figure A.9: Folder organization for a simple project. 

.r 

data doc figures turnout.r 


subfolders called data/ (for your CSV files), one for figures/ (that 
you might save), and perhaps one called docs/ for information 
about your project or data files. Rstudio can help with organization 
as well through its project management features. 

Keeping your project organized will prevent you from ending 
up with huge numbers of files of different kinds all sitting at the 
top of your working directory. 


4 Some Features of This Book 
Preparing the county-level maps 

The U.S. county-level maps in the socviz library were prepared 
using shapefiles from the U.S. Census Bureau that were converted 

to GeoJSON format by Eric Celeste. The code to prepare the eric.cist.org/stuff/usGeoJSON 

imported shapefile was written by Bob Rudis and draws on the 

rgdal library to do the heavy lifting of importing the shapefile 

and transforming the projection. Bob’s code extracts the (county- 

identifying) rownames from the imported spatial data frame and 

then moves Alaska and Hawaii to new locations in the bottom left 

of the map area so that we can map all fifty states. 

First we read in the map file, set the projection, and set up 
an identifying variable we can work with later on to merge in 
data. The call to CRS() is a single long line of text conforming 
to a technical GIS specification defining the projection and other 
details that the map is encoded in. Long lines of code are con¬ 
ventionally indicated by the backslash character, “\,” when we 
have to artificially break them on the page. Do not type the back¬ 
slash if you write out this code yourself. We assume the mapfile is 
named gz_2010_us_050_00_5m.json and is in the data/geojson 
subfolder of the project directory. 
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# You will need to use install.packages() to install 

# these map and GIS libraries if you do not already 

# have them. 

library(maptools) 

library(mapproj) 

library(rgeos) 

library(rgdal) 

us_counties <- readOGR(dsn="data/geojson/gz_2010_us_050_00_5m.json", 
layer="OGRGeoJSON") 

us_counties_aea <- spTransform(us_counties, 

CRS("+proj=laea +lat_0=45 +lon_0=-100 \ 

+x_0-0 +y_0=0 +a=6370997 +b=6370997 \ 

+units=m +no_defs")) 

us_counties_aeaii)data$id rownames(us_counties_aea 5 )data) 

With the file imported, we then extract, rotate, shrink, and 
move Alaska, resetting the projection in the process. We also move 
Hawaii. The areas are identified by their state FIPS codes. We 
remove the old states and put the new ones back in, and remove 
Puerto Rico as our examples lack data for this region. If you have 
data for the area, you can move it between Texas and Florida. 

alaska <- us_counties_aea[us_counties_aea$STATE == "02", ] 
alaska 4 - elide(alaska, rotate = -50) 

alaska 4 - elidefalaska, scale = max(apply(bbox(alaska), 1 , diff))/ 2 .3) 
alaska 4 - elidefalaska, shift = c(-2100000, -2500000)) 
proj4string(alaska) 4 - proj4string(us_counties_aea) 

hawaii 4 - us_counties_aea[us_counties_aea$STATE == "15", ] 
hawaii 4 - elidefhawaii, rotate = -35) 
hawaii 4 - elidefhawaii, shift = c(5400000, -1400000)) 
proj4string(hawaii) 4 - proj4string(us_counties_aea) 

us_counties_aea 4 - us_counties_aea[!us_counties_aea$STATE 'l.in'l. 
c("02", "15", "72"), ] 

us_counties_aea 4- rbind(us_counties_aea, alaska, hawaii) 



Appendix . 259 


Finally, we tidy the spatial object into a data frame that ggplot 
can use and clean up the id label by stripping out a prefix from the 
string. 

county_map •<- tidy(us_counties_aea, region = "GECLID") 
county_map$id •<- stringr::str_replace(county_map$id, 

pattern = "0500000US", replacement = "") 


At this point the COUnty_map object is ready to be merged with For more detail and code for the merge, see 
a table of FIPS-coded U.S. county data using either merge() or github.com/kjhealy/us-county 
left_join(). While I show these steps in detail here, they are also 
conveniently wrapped as functions in the tidycensus library. 

This book's plot theme, and its map theme 

The ggplot theme used in this book is derived principally from 
the work (again) of Bob Rudis. His hrbrthemes package pro¬ 
vides theme_ipsum(), a compact theme that can be used with the 
Arial typeface or, in a variant, the freely available Roboto Con¬ 
densed typeface. The main difference between the theme_book() 
used here andRudis’s theme_ipsuin() is the choice of typeface. The 
hrbrthemes package can be installed from GitHub in the usual 
way: 

devtools::install_github("hrbrmstr/hrbrthemes") 

The books theme is also available on GitHub. This package github.com/kjheaiy/myriad 
does not include the font files themselves. These are available from 
Adobe, which makes the typeface. 

When drawing maps we also used a theme_map() function. 

This theme begins with the built-in theme_bw () and turns off most 
of the guide, scale, and panel content that is not needed when pre¬ 
senting a map. It is available through the socviz library. The code 
looks like this: 


theme_map function(base_size=9, base_family="") { 
require(grid) 

theme_bw(base_size=base_size, base_family=base_famity) y.+replace/. 
theme(axis.line=element_blank(), 
axis.text=element_blank(), 
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axis.ticks=element_blank(), 
axis.title=element_blank(), 
panel.background=element_blank(), 
panel.border=element_blank(), 
panel.grid=element_blank(), 
panel.spacing=unit(0, "lines"), 
plot.background=element_blank(), 
legend, justification = c(0,0), 
legend.position = c(0,0) 

) 


Themes are functions. Creating a theme means writing a func¬ 
tion with a sequence of instructions about what thematic elements 
to modify, and how. We give it a default base_size argument 
and an empty base_family argument (for the font family). The 
'/+ replace/ operator in the code is new to us. This is a convenience 
operator defined by ggplot and used for updating theme elements 
in bulk. Throughout the book we saw repeated use of the + oper¬ 
ator to incrementally add to or tweak the content of a theme, as 
when we would do + theme(legend. position = "top"). Using + 
added the instruction to the theme, adjusting whatever was speci¬ 
fied and leaving everything else as it was. The '/+ replace/ operator 
does something similar but has a stronger effect. We begin with 
theme_bw() and then use a theme () statement to add new con¬ 
tent, as usual. The'/+ replaced operator replaces the entire element 
specified, rather than adding to it. Any element not specified in the 
theme () statement will be deleted from the new theme. So this is a 
way to create themes by starting from existing ones, specifying new 
elements, and deleting anything not explicitly mentioned. See the 
documentation for theme_get() for more details. In the function 
here, you can see each of the thematic elements that are switched 
off using the element_blank() function. 
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foreword 


"Power Corrupts. PowerPoint Corrupts Absolutely." 

—Edward Tufte, Yale Professor Emeritus 1 

We've all been victims of bad slideware. Hit-and-run presentations 
that leave us staggering from a maelstrom of fonts, colors, bullets, 
and highlights. Infographics that fail to be informative and are only 
graphic in the same sense that violence can be graphic. Charts and 
tables in the press that mislead and confuse. 

It's too easy today to generate tables, charts, graphs. I can imagine 
some old-timer (maybe it's me?) harrumphing over my shoulder that 
in his day they'd do illustrations by hand, which meant you had to 
think before committing pen to paper. 

Having all the information in the world at our fingertips doesn't make 
it easier to communicate: it makes it harder. The more information 
you're dealing with, the more difficult it is to filter down to the most 
important bits. 

Enter Cole Nussbaumer Knaflic. 

I met Cole in late 2007. I'd been recruited by Google the year before 
to create the "People Operations" team, responsible forfinding, keep¬ 
ing, and delighting the folks at Google. Shortly after joining I decided 


1 Tufte, Edward R. 'PowerPoint Is Evil.' Wired Magazine, www.wired.com/wired/ 
archive/11.09/ppt2.html, September 2003. 
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we needed a People Analytics team, with a mandate to make sure 
we innovated as much on the people side as we did on the product 
side. Cole became an early and critical member of that team, acting 
as a conduit between the Analytics team and other parts of Google. 

Cole always had a knack for clarity. 

She was given some of our messiest messages—such as what exactly 
makes one manager great and another crummy—and distilled them into 
crisp, pleasing imagery that told an irrefutable story. Her messages of 
"don't be a data fashion victim" (i.e., lose the fancy clipart, graphics and 
fonts—focus on the message) and "simple beats sexy" (i.e., the point is 
to clearly tell a story, not to make a pretty chart) were powerful guides. 

We put Cole on the road, teaching her own data visualization course 
over 50 times in the ensuing six years, before she decided to strike 
out on her own on a self-proclaimed mission to "rid the world of bad 
PowerPoint slides." And if you think that's not a big issue, a Google 
search of "powerpoint kills" returns almost half a million hits! 

In Storytelling with Data, Cole has created an of-the-moment 
complement to the work of data visualization pioneers like Edward 
Tufte. She's worked at and with some of the most data-driven 
organizations on the planet as well as some of the most mission-driven, 
data-free institutions. In both cases, she's helped sharpen their 
messages, and their thinking. 

She's written a fun, accessible, and eminently practical guide to 
extracting the signal from the noise, and for making all of us better 
at getting our voices heard. 

And that's kind of the whole point, isn't it? 


Laszlo Bock 

SVP of People Operations, Google, Inc. 

and author of Work Rules! 


May 2015 
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Bad graphs are everywhere 

I encounter a lot of less-than-stellar visuals in my work (and in my 
life—once you get a discerning eye for this stuff, it's hard to turn it 
off). Nobody sets out to make a bad graph. But it happens. Again and 
again. At every company throughout all industries and by all types 
of people. It happens in the media. It happens in places where you 
would expect people to know better. Why is that? 
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We aren't naturally good at storytelling with data 

In school, we learn a lot about language and math. On the language 
side, we learn how to put words together into sentences and into 
stories. With math, we learn to make sense of numbers. But it's rare 
that these two sides are paired: no one teaches us how to tell stories 
with numbers. Adding to the challenge, very few people feel natu¬ 
rally adept in this space. 

This leaves us poorly prepared for an important task that is increas¬ 
ingly in demand. Technology has enabled us to amass greater and 
greater amounts of data and there is an accompanying growing 
desire to make sense out of all of this data. Being able to visualize 
data and tell stories with it is key to turning it into information that 
can be used to drive better decision making. 

In the absence of natural skills ortraining in this space, we often end 
up relying on our tools to understand best practices. Advances in 
technology, in addition to increasing the amount of and access to 
data, have also made tools to work with data pervasive. Pretty much 
anyone can put some data into a graphing application (for exam¬ 
ple, Excel) and create a graph. This is important to consider, so I 
will repeat myself: anyone can put some data into a graphing appli¬ 
cation and create a graph. This is remarkable, considering that the 
process of creating a graph was historically reserved for scientists or 
those in other highly technical roles. And scary, because without a 
clear path to follow, our best intentions and efforts (combined with 
oft-questionable tool defaults) can lead us in some really bad direc¬ 
tions: 3D, meaningless color, pie charts. 
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Skilled in Microsoft Office? So is everyone else! 


B eing adept with word processing applications, spread¬ 
sheets, and presentation software—things that used 
to set one apart on a resume and in the workplace—has 
become a minimum expectation for most employers. A 
recruiter told me that, today, having "proficiency in Microsoft 
Office" on a resume isn't enough: a basic level of knowledge 
here is assumed and it's what you can do above and beyond 
that will set you apart from others. Being able to effectively 
tell stories with data is one area that will give you that edge 
and position you for success in nearly any role. 


While technology has increased access to and proficiency in tools 
to work with data, there remain gaps in capabilities. You can put 
some data in Excel and create a graph. For many, the process of 
data visualization ends there. This can render the most interesting 
story completely underwhelming, or worse—difficult or impossible 
to understand. Tool defaults and general practices tend to leave 
our data and the stories we want to tell with that data sorely lacking. 

There is a story in your data. But your tools don't know what that 
story is. That's where it takes you—the analyst or communicator of 
the information—to bring that story visually and contextually to life. 
That process is the focus of this book. The following are a few exam¬ 
ple before-and-afters to give you a visual sense of what you'll learn; 
we'll cover each of these in detail at various points in the book. 

The lessons we will cover will enable you to shift from simply show¬ 
ing data to storytelling with data. 
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FIGURE 0.2 Example 1 (before): showing data 


Please approve the hire of 2 FTEs 

to backfill those who quit in the past year 

Ticket volume over time 



2014 

Data source: XYZ Dashboard, as of 12/31/2014 I A detailed analysis on tickets processed per person 
and time to resolve issues was undertaken to inform this request and can be provided if needed. 

FIGURE 0.3 Example 1 (after): storytelling with data 
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Survey Results 


PRE: How do you feel 
about doing science? 

■ Bored ■ Not great ■ OK ■ Kind of interested □ Excited 



POST: How do you feel 
about doing science? 

■ Bored ■ Not great ■ OK ■ Kind of interested n Excited 



FIGURE 0.4 Example 2 (before): showing data 


Pilot program was a success 


How do you feel about science? 


BEFORE program, the 



AFTER 

program, 
more children 
were Kind of 
interested & 
Excited about 
science. 


Bored Not great OK Kind of Excited 

interested 


Based on survey of 100 students conducted before and after pilot program (100% response rate on both surveys). 


FIGURE 0.5 Example 2 (after): storytelling with data 
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Average Retail Product Price per Year 



Product A Product B Product C Product D Product E 
* 2008 ■ 2009 1.2010 *2011 2012 «2013 2014 

FIGURE 0.6 Example 3 (before): showing data 


To be competitive, we recommend introducing our product below 
the $223 average price point in the $150-$200 range 
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FIGURE 0.7 Example 3 (after): storytelling with data 
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Who this book is written for 

This book is written for anyone who needs to communicate some- 
thing to someone using data. This includes (but is certainly not lim¬ 
ited to): analysts sharing the results of their work, students visualizing 
thesis data, managers needing to communicate in a data-driven way, 
philanthropists proving their impact, and leaders informing their 
board. I believe that anyone can improve their ability to communi¬ 
cate effectively with data. This is an intimidating space for many, but 
it does not need to be. 

When you are asked to "show data," what sort of feelings does that 
evoke? 

Perhaps you feel uncomfortable because you are unsure where 
to start. Or maybe it feels like an overwhelming task because you 
assume that what you are creating needs to be complicated and 
show enough detail to answer every possible question. Or perhaps 
you already have a solid foundation here, but are looking for that 
something that will help take your graphs and the stories you want 
to tell with them to the next level. In all of these cases, this book is 
written with you in mind. 
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Being able to tell stories with data is a skill that's becoming ever 
more important in our world of increasing data and desire for data- 
driven decision making. An effective data visualization can mean 
the difference between success and failure when it comes to com¬ 
municating the findings of your study, raising money for your non¬ 
profit, presenting to your board, or simply getting your point across 
to your audience. 

My experience has taught me that most people face a similar chal¬ 
lenge: they may recognize the need to be able to communicate 
effectively with data but feel like they lack expertise in this space. 
People skilled in data visualization are hard to come by. Part of the 
challenge is that data visualization is a single step in the analytical 
process. Those hired into analytical roles typically have quantita¬ 
tive backgrounds that suit them well for the other steps (finding the 
data, pulling it together, analyzing it, building models), but not nec¬ 
essarily any formal training in design to help them when it comes to 
the communication of the analysis—which, by the way, is typically 
the only part of the analytical process that your audience ever sees. 
And increasingly, in our ever more data-driven world, those without 
technical backgrounds are being asked to put on analytical hats and 
communicate using data. 

The feelings of discomfort you may experience in this space aren't 
surprising, given that being able to communicate effectively with 
data isn't something that has been traditionally taught. Those who 
excel have typically learned what works and what doesn't through 
trial and error. This can be a long and tedious process. Through this 
book, I hope to help expedite it for you. 


How I learned to tell stories with data 

I have always been drawn to the space where mathematics and 
business intersect. My educational background is mathematics and 
business, which enables me to communicate effectively with both 
sides—given that they don't always speak the same language—and 
help them better understand one another. I love being able to take 
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the science of data and use it to inform better business decisions. 
Over time, I've found that one key to success is being able to com¬ 
municate effectively visually with data. 

I initially recognized the importance of being skilled in this area dur¬ 
ing my first job out of college. I was working as an analyst in credit 
risk management (before the subprime crisis and hence before any¬ 
one really knew what credit risk management was). My job was to 
build and assess statistical models to forecast delinquency and loss. 
This meant taking complicated stuff and ultimately turning it into a 
simple communication of whether we had adequate money in the 
reserves for expected losses, in what scenarios we'd be at risk, and so 
forth. I quickly learned that spending time on the aesthetic piece— 
something my colleagues didn't typically do—meant my work gar¬ 
nered more attention from my boss and my boss's boss. For me, that 
was the beginning of seeing value in spending time on the visual 
communication of data. 

After progressing through various roles in credit risk, fraud, and oper¬ 
ations management, followed by some time in the private equity 
world, I decided I wanted to continue my career outside of bank¬ 
ing and finance. I paused to reflect on the skills I possessed that I 
wanted to be utilizing on a daily basis: at the core, it was using data 
to influence business decisions. 

I landed at Google, on the People Analytics team. Google is a data- 
driven company—so much so that they even use data and analytics 
in a space not frequently seen: human resources. People Analytics is 
an analytics team embedded in Google's HR organization (referred 
to at Google as "People Operations"). The mantra of this team is 
to help ensure that people decisions at Google—decisions about 
employees orfuture employees—are data driven. This was an amaz¬ 
ing place to continue to hone my storytelling with data skills, using 
data and analytics to better understand and inform decision mak¬ 
ing in spaces like targeted hiring, engaging and motivating employ¬ 
ees, building effective teams, and retaining talent. Google People 
Analytics is cutting edge, helping to forge a path that many other 
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companies have started to follow. Being involved in building and 
growing this team was an incredible experience. 


Storytelling with data on what makes a great 
manager via Project Oxygen 


O ne particular project that has been highlighted in 
the public sphere is the Project Oxygen research at 
Google on what makes a great manager. This work has been 
described in the New York Times and is the basis of a pop¬ 
ular Harvard Business Review case study. One challenge 
faced was communicating the findings to various audiences, 
from engineers who were sometimes skeptical on meth¬ 
odology and wanted to dig into the details, to managers 
wanting to understand the big-picture findings and how to 
put them to use. My involvement in the project was on the 
communication piece, helping to determine how to best 
show sometimes very complicated stuff in a way that would 
appease the engineers and their desire for detail while still 
being understandable and straightforward for managers and 
various levels of leadership. To do this, I leveraged many of 
the concepts we will discuss in this book. 


The big turning point for me happened when we were building an 
internal training program within People Operations at Google and 
I was asked to develop content on data visualization. This gave me 
the opportunity to research and start to learn the principles behind 
effective data visualization, helping me understand why some of the 
things I'd arrived at through trial and error over the years had been 
effective. With this research, I developed a course on data visualiza¬ 
tion that was eventually rolled out to all of Google. 

The course created some buzz, both inside and outside of Google. 
Through a series of fortuitous events, I received invitations to speak 
at a couple of philanthropic organizations and events on the topic of 
data visualization. Word spread. More and more people were reach¬ 
ing outto me—initially in the philanthropic world, but increasingly in 
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the corporate sector as well—looking for guidance on how to com¬ 
municate effectively with data. It was becoming increasingly clear 
that the need in this space was not unique to Google. Rather, pretty 
much anyone in an organization or business setting could increase 
their impact by being able to communicate effectively with data. 
After acting as a speaker at conferences and organizations in my 
spare time, eventually I left Google to pursue my emerging goal of 
teaching the world how to tell stories with data. 

Over the past few years, I've taught workshops for more than a hun¬ 
dred organizations in the United States and Europe. It's been interest¬ 
ing to see that the need for skills in this space spans many industries 
and roles. I've had audiences in consulting, consumer products, edu¬ 
cation, financial services, government, health care, nonprofit, retail, 
startups, and technology. My audiences have been a mix of roles and 
levels: from analysts who work with data on a daily basis to those in 
non-analytical roles who occasionally have to incorporate data into 
their work, to managers needing to provide guidance and feedback, 
to the executive team delivering quarterly results to the board. 

Through this work, I've been exposed to many diverse data visualiza¬ 
tion challenges. I have come to realize that the skills that are needed 
in this area are fundamental. They are not specific to any industry 
or role, and they can be effectively taught and learned—as demon¬ 
strated by the consistent positive feedback and follow-ups I receive 
from workshop attendees. Over time, I've codified the lessons that 
I teach in my workshops. These are the lessons I will share with you. 


How you'll learn to tell stories with data: 6 lessons 

In my workshops, I typically focus on five key lessons. The big oppor¬ 
tunity with this book is that there isn't a time limit (in the way there 
is in a workshop setting). I've included a sixth bonus lesson that I've 
always wanted to share ("think like a designer") and also a lot more 
by way of before-and-after examples, step-by-step instruction, and 
insight into my thought process when it comes to the visual design 
of information. 
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I will give you practical guidance that you can begin using immedi¬ 
ately to better communicate visually with data. We'll cover content 
to help you learn and be comfortable employing six key lessons: 

1. Understand the context 

2. Choose an appropriate visual display 

3. Eliminate clutter 

4. Focus attention where you want it 

5. Think like a designer 

6. Tell a story 

Illustrative examples span many industries 

Throughout the book, I use a number of case studies to illustrate the 
concepts discussed. The lessons we cover will not be industry—or 
role—specific, but rather will focus on fundamental concepts and 
best practices for effective communication with data. Because my 
work spans many industries, so do the examples upon which I draw. 
You will see case studies from technology, education, consumer 
products, the nonprofit sector, and more. 

Each example used is based on a lesson I have taught in my work¬ 
shops, but in many cases I've slightly changed the data or general¬ 
ized the situation to protect confidential information. 

For any example that doesn't initially seem relevant to you, I encour¬ 
age you to pause and think about what data visualization or commu¬ 
nication challenges you encounter where a similar approach could 
be effective. There is something to be learned from every exam¬ 
ple, even if the example itself isn't obviously related to the world in 
which you work. 


Lessons are not tool specific 
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Lessons are not tool specific 

The lessons we will cover in this book focus on best practices that 
can be applied in any graphing application or presentation software. 
There are a vast number of tools that can be leveraged to tell effec¬ 
tive stories with data. No matter how great the tool, however, it will 
never know your data and your story like you do. Take the time to 
learn yourtool well so that it does not become a limiting factor when 
it comes to applying the lessons we'll cover throughout this book. 


How do you do that in Excel? 


W hile I will not focus the discussion on specific tools, 
the examples in this book were created using 
Microsoft Excel. Forthose interested in a closer look at how 
similar visuals can be built in Excel, please visit my blog at 
storytellingwithdata.com, where you can download the Excel 
files that accompany my posts. 


How this book is organized 

This book is organized into a series of big-picture lessons, with each 
chapter focusing on a single core lesson and related concepts. We 
will discuss a bit of theory when it will aid in understanding, but I 
will emphasize the practical application of the theory, often through 
specific, real-world examples. You will leave each chapter ready to 
apply the given lesson. 

The lessons in the book are organized chronologically in the same 
way that I think about the storytelling with data process. Because of 
this and because later chapters do build on and in some cases refer 
back to earlier content, I recommend reading from beginning to 
end. After you've done this, you'll likely find yourself referring back 
to specific points of interest or examples that are relevant to the cur¬ 
rent data visualization challenges you face. 
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To give you a more specific idea of the path we'll take, chapter sum¬ 
maries can be found below. 


Chapter 1: the importance of context 

Before you start down the path of data visualization, there are a 
couple of questions that you should be able to concisely answer: 
Who is your audience? What do you need them to know or do? This 
chapter describes the importance of understanding the situational 
context, including the audience, communication mechanism, and 
desired tone. A number of concepts are introduced and illustrated 
via example to help ensure that context is fully understood. Creating 
a robust understanding of the situational context reduces iterations 
down the road and sets you on the path to success when it comes 
to creating visual content. 


Chapter 2: choosing an effective visual 

What is the best way to show the data you want to communicate? 
I've analyzed the visual displays I use most in my work. In this chap¬ 
ter, I introduce the most common types of visuals used to commu¬ 
nicate data in a business setting, discuss appropriate use cases for 
each, and illustrate each through real-world examples. Specifictypes 
of visuals covered include simple text, table, heatmap, line graph, 
slopegraph, vertical bar chart, vertical stacked bar chart, waterfall 
chart, horizontal bar chart, horizontal stacked bar chart, and square 
area graph. We also cover visuals to be avoided, including pie and 
donut charts, and discuss reasons for avoiding 3D. 


Chapter 3: clutter is your enemy! 

Picture a blank page or a blank screen: every single element you 
add to that page or screen takes up cognitive load on the part of 
your audience. That means we should take a discerning eye to the 
elements we allow on our page or screen and work to identify those 
things that are taking up brain power unnecessarily and remove 
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them. Identifying and eliminating clutter is the focus of this chap¬ 
ter. As part of this conversation, I introduce and discuss the Gestalt 
Principles of Visual Perception and how we can apply them to visual 
displays of information such as tables and graphs. We also discuss 
alignment, strategic use of white space, and contrast as important 
components of thoughtful design. Several examples are used to 
illustrate the lessons. 


Chapter 4: focus your audience's attention 

In this chapter, we continue to examine how people see and how you 
can use that to your advantage when crafting visuals. This includes 
a brief discussion on sight and memory that will act to frame up the 
importance of preattentive attributes like size, color, and position 
on page. We explore how preattentive attributes can be used stra¬ 
tegically to help direct your audience's attention to where you want 
them to focus and to create a visual hierarchy of components to help 
direct your audience through the information you want to commu¬ 
nicate in the way you want them to process it. Color as a strategic 
tool is covered in depth. Concepts are illustrated through a num¬ 
ber of examples. 


Chapter 5: think like a designer 

Form follows function. This adage of product design has clear appli¬ 
cation to communicating with data. When it comes to the form and 
function of our data visualizations, we first want to think about what it 
is we want our audience to be able to do with the data (function) and 
create a visualization (form) that will allow for this with ease. In this 
chapter, we discuss how traditional design concepts can be applied 
to communicating with data. We explore affordances, accessibility, 
and aesthetics, drawing upon a number of concepts introduced pre¬ 
viously, but looking at them through a slightly different lens. We also 
discuss strategies for gaining audience acceptance of your visual 
designs. 
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Chapter 6: dissecting model visuals 

Much can be learned from a thorough examination of effective visual 
displays. In this chapter, we look at five exemplary visuals and dis¬ 
cuss the specific thought process and design choices that led to their 
creation, utilizing the lessons covered up to this point. We explore 
decisions regarding the type of graph and ordering of data within 
the visual. We consider choices around what and how to empha¬ 
size and de-emphasize through use of color, thickness of lines, and 
relative size. We discuss alignment and positioning of components 
within the visuals and also the effective use of words to title, label, 
and annotate. 


Chapter 7: lessons in storytelling 

Stories resonate and stick with us in ways that data alone cannot. In 
this chapter, I introduce concepts of storytelling that can be lever¬ 
aged for communicating with data. We consider what can be learned 
from master storytellers. A story has a clear beginning, middle, and 
end; we discuss how this framework applies to and can be used when 
constructing business presentations. We cover strategies for effective 
storytelling, including the power of repetition, narrative flow, con¬ 
siderations with spoken and written narratives, and various tactics to 
ensure that our story comes across clearly in our communications. 


Chapter 8: pulling it all together 

Previous chapters included piecemeal applications to demonstrate 
individual lessons covered. In this comprehensive chapter, we follow 
the storytelling with data process from start to finish using a single 
real-world example. We understand the context, choose an appro¬ 
priate visual display, identify and eliminate clutter, draw attention 
to where we want our audience to focus, think like a designer, and 
tell a story. Together, these lessons and resulting visuals and narra¬ 
tive illustrate how we can move from simply showing data to telling 
a story with data. 
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Chapter 9: case studies 

The penultimate chapter explores specific strategies for tackling 
common challenges faced in communicating with data through a 
number of case studies. Topics covered include color considerations 
with a dark background, leveraging animation in the visuals you pres¬ 
ent versus those you circulate, establishing logic in order, strategies 
for avoiding the spaghetti graph, and alternatives to pie charts. 


Chapter 10: final thoughts 

Data visualization—and communicating with data in general—sits 
at the intersection of science and art. There is certainly some sci¬ 
ence to it: best practices and guidelines to follow. There is also an 
artistic component. Apply the lessons we've covered to forge your 
path, using your artistic license to make the information easier for 
your audience to understand. In this final chapter, we discuss tips on 
where to go from here and strategies for upskilling storytelling with 
data competency in your team and your organization. We end with 
a recap of the main lessons covered. 

Collectively, the lessons we'll cover will enable you to tell stories with 
data. Let's get started! 



chapter one 


the importance of 

context 


This may sound counterintuitive, but success in data visualization 
does not start with data visualization. Rather, before you begin down 
the path of creating a data visualization or communication, atten¬ 
tion and time should be paid to understanding the context for the 
need to communicate. In this chapter, we will focus on understand¬ 
ing the important components of context and discuss some strate¬ 
gies to help set you up for success when it comes to communicating 
visually with data. 


Exploratory vs. explanatory analysis 

Before we get into the specifics of context, there is one important 
distinction to draw, between exploratory and explanatory analysis. 
Exploratory analysis is what you do to understand the data and figure 
out what might be noteworthy or interesting to highlight to others. 
When we do exploratory analysis, it's like hunting for pearls in oysters. 
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the importance of context 


We might have to open 100 oysters (test 100 different hypotheses 
or look at the data in 100 different ways) to find perhaps two pearls. 
When we're at the point of communicating our analysis to our audi¬ 
ence, we really want to be in the explanatory space, meaning you 
have a specific thing you want to explain, a specific story you want 
to tell—probably about those two pearls. 

Too often, people err and think it's OK to show exploratory analysis 
(simply present the data, all 100 oysters) when they should be show¬ 
ing explanatory (taking the time to turn the data into information 
that can be consumed by an audience: the two pearls). It is an under¬ 
standable mistake. After undertaking an entire analysis, it can be 
tempting to want to show your audience everything, as evidence of 
all of the work you did and the robustness of the analysis. Resist this 
urge. You are making your audience reopen all of the oysters! Con¬ 
centrate on the pearls, the information your audience needs to know. 

Here, we focus on explanatory analysis and communication. 


Recommended reading 


F or those interested in learning more about exploratory 
analysis, check out Nathan Yau's book, Data Points. Yau 
focuses on data visualization as a medium, rather than a tool, 
and spends a good portion of the book discussing the data 
itself and strategies for exploring and analyzing it. 


Who, what, and how 

When it comes to explanatory analysis, there are a few things to think 
about and be extremely clear on before visualizing any data or creat¬ 
ing content. First, To whom are you communicating? It is important 
to have a good understanding of who your audience is and how they 
perceive you. This can help you to identify common ground that will 
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help you ensure they hear your message. Second, What do you want 
your audience to know or do? You should be clear how you want your 
audience to act and take into account how you will communicate to 
them and the overall tone that you want to set for your communication. 

It's only after you can concisely answer these first two questions that 
you're ready to move forward with the third: How can you use data 
to help make your point? 

Let's look at the context of who, what, and how in a little more detail. 


Who 


Your audience 

The more specific you can be about who your audience is, the better 
position you will be in for successful communication. Avoid general 
audiences, such as "internal and external stakeholders" or "anyone 
who might be interested"—by trying to communicate to too many 
different people with disparate needs at once, you put yourself in a 
position where you can't communicate to any one of them as effec¬ 
tively as you could if you narrowed yourtarget audience. Sometimes 
this means creating different communications for different audi¬ 
ences. Identifying the decision maker is one way of narrowing your 
audience. The more you know about your audience, the better posi¬ 
tioned you'll be to understand how to resonate with them and form 
a communication that will meet their needs and yours. 


You 

It's also helpful to think about the relationship that you have with 
your audience and how you expect that they will perceive you. Will 
you be encountering each other for the first time through this com¬ 
munication, or do you have an established relationship? Do they 
already trust you as an expert, or do you need to work to establish 
credibility? These are important considerations when it comes to 
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determining how to structure your communication and whether and 
when to use data, and may impact the order and flow of the overall 
story you aim to tell. 


Recommended reading 


I n Nancy Duarte's book Resonate, she recommends thinking 
of your audience as the hero and outlines specific strategies 
for getting to know your audience, segmenting your 
audience, and creating common ground. A free multimedia 
version of Resonate is available at duarte.com. 


What 

Action 

What do you need your audience to know or do? This is the point 
where you think through how to make what you communicate rel¬ 
evant for your audience and form a clear understanding of why 
they should care about what you say. You should always want your 
audience to know or do something. If you can't concisely articulate 
that, you should revisit whether you need to communicate in the 
first place. 

This can be an uncomfortable space for many. Often, this discom¬ 
fort seems to be driven by the belief that the audience knows better 
than the presenter and therefore should choose whether and how 
to act on the information presented. This assumption is false. If you 
are the one analyzing and communicating the data, you likely know 
it best —you are a subject matter expert. This puts you in a unique 
position to interpretthe data and help lead peopleto understanding 
and action. In general, those communicating with data need to take 
a more confident stance when it comes to making specific obser¬ 
vations and recommendations based on their analysis. This will feel 
outside of your comfort zone if you haven't been routinely doing it. 
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Start doing it now—it will get easier with time. And know that even 
if you highlight or recommend the wrong thing, it prompts the right 
sort of conversation focused on action. 

When it really isn't appropriate to recommend an action explic¬ 
itly, encourage discussion toward one. Suggesting possible next 
steps can be a great way to get the conversation going because it 
gives your audience something to react to rather than starting with 
a blank slate. If you simply present data, it's easy for your audience 
to say, "Oh, that's interesting," and move on to the next thing. But 
if you ask for action, your audience has to make a decision whether 
to comply or not. This elicits a more productive reaction from your 
audience, which can lead to a more productive conversation—one 
that might never have been started if you hadn't recommended the 
action in the first place. 


Prompting action 


H ere are some action words to help act as thought starters 
as you determine what you are asking of your audience: 

accept agree | begin | believe | change | collaborate | commence 
| create | defend | desire | differentiate | do | empathize | 
empower | encourage | engage | establish | examine | facilitate 
familiarize form | implement | include | influence invest | 
invigorate | know | learn | like j persuade | plan | promote 
pursue | recommend | receive | remember | report | respond | 
secure | support | simplify | start | try | understand | validate 


Mechanism 

How will you communicate to your audience? The method you will 
use to communicate to your audience has implications on a number 
of factors, including the amount of control you will have over how 
the audience takes in the information and the level of detail that 
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needs to be explicit. We can thinkof the communication mechanism 
along a continuum, with live presentation at the left and a written 
document or email at the right, as shown in Figure 1.1. Consider the 
level of control you have over how the information is consumed as 
well as the amount of detail needed at either end of the spectrum. 


l\ f£ESENTAT| OU .WRITTEN/ bOC & EMAlU 



FIGURE 1.1 Communication mechanism continuum 


At the left, with a live presentation, you (the presenter) are in full 
control. You determine what the audience sees and when they see 
it. You can respond to visual cues to speed up, slow down, or go into 
a particular point in more or less detail. Not all of the detail needs 
to be directly in the communication (the presentation or slide deck), 
because you, the subject matter expert, are there to answer any 
questions that arise over the course of the presentation and should 
be able and prepared to do so irrespective of whether that detail is 
in the presentation itself. 
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For live presentations, practice makes perfect 


D o not use your slides as your teleprompter! If you find 
yourself reading each slide out loud during a presenta¬ 
tion, you are using them as one. This creates a painful audi¬ 
ence experience. You have to know your content to give a 
good presentation and this means practice, practice, and 
more practice! Keep your slides sparse, and only put things 
on them that help reinforce what you will say. Your slides can 
remind you of the next topic, but shouldn't act as your speak¬ 
ing notes. 

Here are a few tips for getting comfortable with your material 
as you prepare for your presentation: 

• Write out speaking notes with the important points you 
want to make with each slide. 

• Practice what you want to say out loud to yourself: this 
ignites a different part of the brain to help you remember 
your talking points. It also forces you to articulate the tran¬ 
sitions between slides that sometimes trip up presenters. 

• Give a mock presentation to a friend or colleague. 


At the right side of the spectrum, with a written document or email, 
you (the creator of the document or email) have less control. In this 
case, the audience is in control of how they consume the information. 
The level of detail that is needed here is typically higher because 
you aren't there to see and respond to your audience's cues. Rather, 
the document will need to directly address more of the potential 
questions. 

In an ideal world, the work product for the two sides of this contin¬ 
uum would be totally different—sparse slides for a live presentation 
(since you're there to explain anything in more detail as needed), and 




26 


the importance of context 


denser documents when the audience is left to consume on their 
own. But in reality—due to time and other constraints—it is often 
the same product that is created to try to meet both of these needs. 
This gives rise to the slideument, a single document that's meant to 
solve both of these needs. This poses some challenges because of 
the diverse needs it is meant to satisfy, but we'll look at strategies 
for addressing and overcoming these challenges later in the book. 

At this point at the onset of the communication process, it is important 
to identify the primary communication vehicle you'll be leveraging: 
live presentation, written document, or something else. Consider¬ 
ations on how much control you'll have over how your audience con¬ 
sumes the information and the level of detail needed will become 
very important once you start to generate content. 


Tone 

Whattone do you want your communication to set? Another impor¬ 
tant consideration is the tone you want your communication to con¬ 
vey to your audience. Are you celebrating a success? Trying to light a 
fire to drive action? Is the topic lighthearted or serious?The tone you 
desire for your communication will have implications on the design 
choices that we will discuss in future chapters. For now, think about 
and specify the general tone that you want to establish when you 
set out on the data visualization path. 


How 

Finally—and only after we can clearly articulate who our audience 
is and what we need them to know or do—we can turn to the data 
and ask the question: What data is available that will help make my 
point? Data becomes supporting evidence of the story you will build 
and tell. We'll discuss much more on how to present this data visu¬ 
ally in subsequent chapters. 


Who, what, and how: illustrated by example 
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Ignore the nonsupporting data? 


Y ou might assume that showing only the data that backs 
up your point and ignoring the rest will make for a stron¬ 
ger case. I do not recommend this. Beyond being misleading 
by painting a one-sided story, this is very risky. A discern¬ 
ing audience will poke holes in a story that doesn't hold up 
or data that shows one aspect but ignores the rest. The right 
amount of context and supporting and opposing data will 
vary depending on the situation, the level of trust you have 
with your audience, and other factors. 


Who, what, and how: illustrated by example 

Let's consider a specific example to illustrate these concepts. Imagine 
you are a fourth grade science teacher. You just wrapped up an exper¬ 
imental pilot summer learning program on science that was aimed 
at giving kids exposure to the unpopular subject. You surveyed the 
children atthe onset and end of the program to understand whether 
and how perceptions toward science changed. You believe the data 
shows a great success story. You would like to continue to offer the 
summer learning program on science going forward. 

Let's start with the who by identifying our audience. There are a num¬ 
ber of different potential audiences who might be interested in this 
information: parents of students who participated in the program, 
parents of prospective future participants, the future potential par¬ 
ticipants themselves, other teachers who might be interested in 
doing something similar, or the budget committee that controls the 
funding you need to continue the program. You can imagine how 
the story you would tell to each of these audiences might differ. The 
emphasis might change. The call to action would be different for the 
different groups. The data you would show (or the decision to show 
data at all) could be different for the various audiences. You can 
imagine how, if we crafted a single communication meant to address 
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all of these disparate audiences' needs, it would likely not exactly 
meet any single audience's need. This illustrates the importance of 
identifying a specific audience and crafting a communication with 
that specific audience in mind. 

Let's assume in this case the audience we want to communicate to 
is the budget committee, which controls the funding we need to 
continue the program. 

Now that we have answered the question of who, the what becomes 
easierto identify and articulate. If we're addressing the budget com¬ 
mittee, a likely focus would be to demonstrate the success of the 
program and ask for a specific funding amount to continue to offer it. 
After identifying who our audience is and what we need from them, 
next we can think about the data we have available that will act as 
evidence of the story we want to tell. We can leverage the data col¬ 
lected via survey at the onset and end of the program to illustrate 
the increase in positive perceptions of science before and after the 
pilot summer learning program. 

This won't be the last time we'll consider this example. Let's recap 
who we have identified as our audience, what we need them to know 
and do, and the data that will help us make our case: 

Who: The budget committee that can approve funding for con¬ 
tinuation of the summer learning program. 

What: The summer learning program on science was a success; 
please approve budget of $X to continue. 

How: Illustrate success with data collected through the survey 
conducted before and after the pilot program. 


Consulting for context: questions to ask 

Often, the communication or deliverable you are creating is at the 
request of someone else: a client, a stakeholder, or your boss. This 
means you may not have all of the context and might need to consult 


The 3-minute story & Big Idea 
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with the requester to fully understand the situation. There is some¬ 
times additional context in the head of this requester that they may 
assume is known or not think to say out loud. Following are some 
questions you can use as you work to tease out this information. If 
you're on the requesting side of the communication and asking your 
support team to build a communication, think about answering these 
questions for them up front: 

• What background information is relevant or essential? 

• Who is the audience or decision maker? What do we know about 
them? 

• What biases does our audience have that might make them sup¬ 
portive of or resistant to our message? 

• What data is available that would strengthen our case? Is our audi¬ 
ence familiar with this data, or is it new? 

• Where are the risks: what factors could weaken our case and do 
we need to proactively address them? 

• What would a successful outcome look like? 

• If you only had a limited amount of time or a single sentence to 
tell your audience what they need to know, what would you say? 

In particular, I find that these last two questions can lead to insight¬ 
ful conversation. Knowing what the desired outcome is before you 
start preparing the communication is critical for structuring it well. 
Putting a significant constraint on the message (a short amount of 
time or a single sentence) can help you to boil the overall com¬ 
munication down to the single, most important message. To that 
end, there are a couple of concepts I recommend knowing and 
employing: the 3-minute story and the Big Idea. 


The 3-minute story & Big Idea 

The idea behind each of these concepts is that you are able to boil 
the "so-what" down to a paragraph and, ultimately, to a single, 
concise statement. You have to really know your stuff—know what 
the most important pieces are as well as what isn't essential in the 


30 


the importance of context 


most stripped-down version. While it sounds easy, being concise 
is often more challenging than being verbose. Mathematician and 
philosopher Blaise Pascal recognized this in his native French, with a 
statement that translates roughly to "I would have written a shorter 
letter, but I did not have the time" (a sentiment often attributed to 
Mark Twain). 


3-minute story 

The 3-minute story is exactly that: if you had only three minutes 
to tell your audience what they need to know, what would you 
say? This is a great way to ensure you are clear on and can articu¬ 
late the story you want to tell. Being able to do this removes you 
from dependence on your slides or visuals for a presentation. This 
is useful in the situation where your boss asks you what you're 
working on or if you find yourself in an elevator with one of your 
stakeholders and want to give her the quick rundown. Or if your 
half-hour on the agenda gets shortened to ten minutes, or to five. 
If you know exactly what it is you want to communicate, you can 
make it fit the time slot you're given, even if it isn't the one for 
which you are prepared. 


Big Idea 

The Big Idea boils the so-what down even further: to a single sen¬ 
tence. This is a concept that Nancy Duarte discusses in her book, 
Resonate (2010). She says the Big Idea has three components: 

1. It must articulate your unique point of view; 

2. It must convey what's at stake; and 

3. It must be a complete sentence. 

Let's consider an illustrative 3-minute story and Big Idea, leveraging 
the summer learning program on science example that was intro¬ 
duced previously. 


Storyboarding 
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3-minute story: A group of us in the science department were 
brainstorming about how to resolve an ongoing issue we have with 
incoming fourth-graders. It seems that when kids get to their first 
science class, they come in with this attitude that it's going to be 
difficult and they aren't going to like it. It takes a good amount of 
time at the beginning of the school year to get beyond that. So we 
thought, what if we try to give kids exposure to science sooner? 
Can we influence their perception? We piloted a learning pro¬ 
gram last summer aimed at doing just that. We invited elementary 
school students and ended up with a large group of second- and 
third-graders. Our goal was to give them earlier exposure to sci¬ 
ence in hopes of forming positive perception. To test whether we 
were successful, we surveyed the students before and after the 
program. We found that, going into the program, the biggest 
segment of students, 40%, felt just "OK" about science, whereas 
after the program, most of these shifted into positive perceptions, 
with nearly 70% of total students expressing some level of inter¬ 
est toward science. We feel that this demonstrates the success of 
the program and that we should not only continue to offer it, but 
also to expand our reach with it going forward. 

Big Idea: The pilot summer learning program was successful at 
improving students' perceptions of science and, because of this 
success, we recommend continuing to offer it going forward; 
please approve our budget for this program. 

When you've articulated your story this clearly and concisely, creat¬ 
ing content for your communication becomes much easier. Let's shift 
gears now and discuss a specific strategy when it comes to planning 
content: storyboarding. 


Storyboarding 

Storyboarding is perhaps the single most important thing you can 
do up front to ensure the communication you craft is on point. The 
storyboard establishes a structure for your communication. It is a 
visual outline of the content you plan to create. It can be subject to 
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change as you workthrough the details, but establishing a structure 
early on will set you up for success. When you can (and as makes 
sense), get acceptance from your client or stakeholder at this step. 
It will help ensure that what you're planning is in line with the need. 

When it comes to storyboarding, the biggest piece of advice I have 
is this: don't start with presentation software. It is too easy to go 
into slide-generating mode without thinking about how the pieces 
fit together and end up with a massive presentation deck that says 
nothing effectively. Additionally, as we start creating content via our 
computer, something happens that causes us to form an attachment 
to it. This attachment can be such that, even if we know what we've 
created isn't exactly on the markorshould be changed oreliminated, 
we are sometimes resistant to doing so because of the work we've 
already put in to get it to where it is. 

Avoid this unnecessary attachment (and work!) by starting low tech. 
Use a whiteboard, Post-it notes, or plain paper. It's much easier to put 
a line through an idea on a piece of paper or recycle a Post-it note 
without feeling the same sense of loss as when you cut something 
you've spent time creating with your computer. I like using Post-it 
notes when I storyboard because you can rearrange (and add and 
remove) the pieces easily to explore different narrative flows. 

If we storyboard our communication for the summer learning pro¬ 
gram on science, it might look something like Figure 1.2. 

Note that in this example storyboard, the Big Idea is at the end, in 
the recommendation. Perhaps we'd want to consider leading with 
that to ensure that our audience doesn't miss the main point and to 
help set up why we are communicating to them and why they should 
care in the first place. We'll discuss additional considerations related 
to the narrative order and flow in Chapter 7. 


In closing 
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FIGURE 1.2 Example storyboard 


In closing 

When it comes to explanatory analysis, being able to concisely artic¬ 
ulate exactly who you want to communicate to and what you want 
to convey before you start to build content reduces iterations and 
helps ensure that the communication you build meets the intended 
purpose. Understanding and employing concepts like the 3-minute 
story, the Big Idea, and storyboarding will enable you to clearly and 
succinctly tell your story and identify the desired flow. 

While pausing before actually building the communication might feel 
like it's a step that slows you down, in fact it helps ensure that you 
have a solid understanding of what you want to do before you start 
creating content, which will save you time down the road. 

With that, consider your first lesson learned. You now understand 
the importance of context. 










chapter two 


choosing an effective 

visual 


There are many different graphs and othertypes of visual displays of 
information, but a handful will work for the majority of your needs. 
When I look back over the 150+ visuals that I created for workshops 
and consulting projects in the past year, there were only a dozen dif¬ 
ferent types of visuals that I used (Figure 2.1). These are the visuals 
we'll focus on in this chapter. 
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Simple text 


Scatterplot 



A 

B 

C 

Category 1 

15% 

22% 

42% 

Category 2 

40% 

36% 

20% 

Category 3 

35% 

17% 

34% 

Category 4 

30% 

29% 

26% 

Category 5 

55% 

30% 

58% 

Category 6 

11% 

25% 

49% 

Table 


A 

B 

C 

Category 1 

15% 

22% 

42% 

Category 2 

40% 

36% 

20% 

Category 3 

35% 

17% 

34% 

Category 4 




Category 5 

55% 


58% 

Category 6 

11% 

25% 

49% 


Heatmap 

FIGURE 2.1 The visuals I use most 




Slopegraph 
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Vertical bar 



Horizontal bar 



Stacked vertical bar Stacked horizontal bar 



Waterfall 



Square area 
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Simple text 

When you have just a number or two to share, simple text can be a 
great way to communicate. Think about solely using the number— 
making it as prominent as possible—and a few supporting words 
to clearly make your point. Beyond potentially being misleading, 
putting one or only a couple of numbers in a table or graph simply 
causes the numbers to lose some of their oomph. When you have 
a number or two that you want to communicate, think about using 
the numbers themselves. 

To illustrate this concept, let's consider the following example. A 
graph similarto Figure 2.2 accompanied an April 2014 Pew Research 
Center report on stay-at-home moms. 


Children with a 
"Traditional" Stay-at- 
Home Mother 

% of children with a married 
stay-at-home mother with a 
working husband 


41 



1970 2012 


Note: Based on children younger than 18. 

Their mothers are categorized based on 
employment status in 1970 and 2012. 

Source: Pew Research Center analysis of 
March Current Population Surveys 
Integrated Public Use Microdata Series 
(IPUMS-CPS), 1971 and 2013 

Adapted from PEW RESEARCH CENTER 

FIGURE 2.2 Stay-at-home moms original graph 
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The fact that you have some numbers does not mean that you need 
a graph! In Figure 2.2, quite a lot of text and space are used for a 
grand total of two numbers. The graph doesn't do much to aid in 
the interpretation of the numbers (and with the positioning of the 
data labels outside of the bars, it can even skew your perception of 
relative height such that 20 is less than half of 41 doesn't really come 
across visually). 

In this case, a simple sentence would suffice: 20% of children had 
a traditional stay-at-home mom in 2012, compared to 41% in 1970. 

Alternatively, in a presentation orreport, yourvisual could look some¬ 
thing like Figure 2.3. 



of children had a 

traditional stay-at-home mom 

in 2012, compared to 41% in 1970 

FIGURE 2.3 Stay-at-home moms simple text makeover 

As a side note, one consideration in this specific example might be 
whether you want to show an entirely different metric. For example, 
you could reframe in terms of the percent change: "The number of 
children having a traditional stay-at-home mom decreased more 
than 50% between 1970 and 2012." I advise caution, however, any 
time you reduce from multiple numbers down to a single one—think 
about what context may be lost in doing so. In this case, I find that 
the actual magnitude of the numbers (20% and 41%) is helpful in 
interpreting and understanding the change. 
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When you have just a number or two that you want to communicate: 
use the numbers directly. 

When you have more data that you want to show, generally a table 
or graph is the way to go. One thing to understand is that people 
interact differently with these two types of visuals. Let's discuss each 
in detail and look at some specific varieties and use cases. 


Tables 

Tables interact with our verbal system, which means that we read 
them. When I have a table in front of me, I typically have my index 
finger out: I'm reading across rows and down columns or I'm com¬ 
paring values. Tables are great for just that—communicating to a 
mixed audience whose members will each look for their particular 
row of interest. If you need to communicate multiple different units 
of measure, this is also typically easier with a table than a graph. 


Tables in live presentations 


U sing a table in a live presentation is rarely a good idea. 

As your audience reads it, you lose their ears and atten¬ 
tion to make your point verbally. When you find yourself 
using a table in a presentation or report, ask yourself: what 
is the point you are trying to make? Odds are that there will 
be a better way to pull out and visualize the piece or pieces 
of interest. In the event that you feel you're losing too much 
by doing this, consider whether including the full table in the 
appendix and a link or reference to it will meet your audi¬ 
ence's needs. 


One thing to keep in mind with a table is that you want the design to 
fade into the background, letting the data take center stage. Don't 
let heavy borders or shading compete for attention. Instead, think 
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of using light borders or simply white space to set apart elements 
of the table. 

Take a look at the example tables in Figure 2.4. As you do, note how 
the data stands out more than the structural components of the table 
in the second and third iterations (light borders, minimal borders). 


Fleavy borders 


Group Metric A Metric B Metric C 

Group 1 

$X.X 

Y% 

Z,ZZZ 

Group 2 

$X.X 

Y% 

Z,ZZZ 

Group 3 

$X.X 

Y% 

z,zzz 

Group 4 

$X.X 

Y% 

z,zzz 

Group 5 

$X.X 

Y% 

z,zzz 


FIGURE 2.4 Table borders 


Light borders 


I Group 

Metric A 1 Metric B 1 Metric C 

Group 1 

$x.x 

Y% 

Z,ZZZ 

Group 2 

$x.x 

Y% 

Z,ZZZ 

Group 3 

$x.x 

Y% 

Z,ZZZ 

Group 4 

$x.x 

Y% 

Z,ZZZ 

Group 5 

$x.x 

Y% 

z,zzz 


Minimal borders 


Group 

Metric A 

Metric B 

Metric C 

Group 1 

$X.X 

Y% 

Z,ZZZ 

Group 2 

$X.X 

Y% 

Z.ZZZ 

Group 3 

$X.X 

Y% 

Z,ZZZ 

Group 4 

$X.X 

Y% 

Z.ZZZ 

Group 5 

$X.X 

Y% 

Z.ZZZ 


Bordersshould be used to improve the legibility of yourtable. Think 
about pushing them to the background by making them grey, or 
getting rid of them altogether. The data should be what stands out, 
not the borders. 


Recommended reading 


F or more on table design, check out Stephen Few's book, 
Show Me the Numbers. There is an entire chapter dedi¬ 
cated to the design of tables, with discussion on the struc¬ 
tural components of tables and best practices in table 
design. 


Next, let's shift our focus to a special case of tables: the heatmap. 
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Heatmap 

One approach for mixing the detail you can include in a table while 
also making use of visual cues is via a heatmap. A heatmap is a way 
to visualize data in tabular format, where in place of (or in addition 
to) the numbers, you leverage colored cells that convey the relative 
magnitude of the numbers. 

Consider Figure 2.5, which shows some generic data in a table and 
also a heatmap. 


Table Heatmap 

LOW-HIGH 



A 

B 

C 


A 

B 

C 

Category 1 

15% 

22% 

42% 

Category 1 

15% 

22% 

42% 

Category 2 

40% 

36% 

20% 

Category 2 

40% 

36% 

20% 

Category 3 

35% 

17% 

34% 

Category 3 

35% 

17% 

34% 

Category 4 

30% 

29% 

26% 

Category 4 


Category 5 

55% 

30% 

58% 

Category 5 

55% 


58% 

Category 6 

11% 

25% 

49% 

Category 6 

11% 

25% 

49% 

FIGURE 2.5 Two views of the same 

data 





In the table in Figure 2.5, you are left to read the data. I find myself 
scanning across rows and down columns to get a sense of what I'm 
looking at, where numbers are higher or lower, and mentally stack 
rank the categories presented in the table. 

To reduce this mental processing, we can use color saturation to 
provide visual cues, helping our eyes and brains more quickly target 
the potential points of interest. In the second iteration of the table 
on the right entitled "Heatmap," the higher saturation of blue, the 
higher the number. This makes the process of picking out the tails 
of the spectrum—the lowest number (11%) and highest number 
(58%)—an easier and faster process than it was in the original table 
where we didn't have any visual cues to help direct our attention. 

Graphing applications (like Excel) typically have conditional format¬ 
ting functionality built in that allows you to apply formatting like 
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that shown in Figure 2.5 with ease. Be sure when you leverage this 
to always include a legend to help the reader interpret the data (in 
this case, the LOW-FIIGFI subtitle on the heatmap with color corre¬ 
sponding to the conditional formatting color serves this purpose). 

Next, let's shift our discussion to the visuals we tend to think of first 
when it comes to communicating with data: graphs. 


Graphs 

While tables interact with our verbal system, graphs interact with 
our visual system, which is faster at processing information. This 
means that a well-designed graph will typically get the information 
across more quickly than a well-designed table. As I mentioned at 
the onset of this chapter, there are a plethora of graph types out 
there. The good news is that a handful of them will meet most of 
your everyday needs. 

The types of graphs I frequently use fall into four categories: points, 
lines, bars, and area. We will examine these more closely and discuss 
the subtypes that I find myself using on a regular basis, with specific 
use cases and examples for each. 


Chart or graph? 


S ome draw a distinction between charts and graphs. 

Typically, "chart" is the broader category, with "graphs" 
being one of the subtypes (other chart types include maps 
and diagrams). I don't tend to draw this distinction, since 
nearly all of the charts I deal with on a regular basis are 
graphs. Throughout this book, I use the words chart and 
graph interchangeably. 
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Points 


Scatterplot 

Scatterplots can be useful for showing the relationship between two 
things, because they allow you to encode data simultaneously on a 
horizontal x-axis and vertical y-axis to see whether and what relation¬ 
ship exists. They tend to be more frequently used in scientific fields 
(and perhaps, because of this, are sometimes viewed as complicated 
to understand by those less familiar with them). Though infrequent, 
there are use cases for scatterplots in the business world as well. 

For example, let's say that we manage a bus fleet and want to under¬ 
stand the relationship between miles driven and cost per mile. The 
scatterplot may look something like Figure 2.6. 


Cost per mile by miles driven 

4? $3.00 1 
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00 
O 
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Miles driven per month 

FIGURE 2.6 Scatterplot 

If we want to focus primarily on those cases where cost per mile is 
above average, a slightly modified scatterplot designed to draw our 
eye there more quickly might look something like what is shown in 
Figure 2.7. 
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Cost per mile by miles driven 
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FIGURE 2.7 Modified scatterplot 



We can use Figure 2.7 to make observations such as cost per mile is 
higherthan average when less than about 1,700 miles or more than 
about 3,300 miles were driven for the sample observed. We'll talk 
more about the design choices made here and reasons for them in 
upcoming chapters. 


Lines 

Line graphs are most commonly used to plot continuous data. 
Because the points are physically connected via the line, it implies 
a connection between the points that may not make sense for cat¬ 
egorical data (a set of data that is sorted or divided into different 
categories). Often, ourcontinuous data is in some unit of time: days, 
months, quarters, or years. 

Within the line graph category, there are two types of charts that I fre¬ 
quently find myself using: the standard line graph and theslopegraph. 
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Line graph 

The line graph can show a single series of data, two series of data, 
or multiple series, as illustrated in Figure 2.8. 



FIGURE 2.8 Line graphs 


Note that when you're graphing time on the horizontal x-axis of a 
line graph, the data plotted must be in consistent intervals. I recently 
saw a graph where the units on the x-axis were decades from 1900 
forward (1910,1920,1930, etc.) and then switched to yearly after 2010 
(2011, 2012, 2013, 2014). This meant that the distance between the 
decade points and annual points looked the same. This is a mislead¬ 
ing way to show the data. Be consistent in the time points you plot. 


Showing average within a range in a line graph 


I n some cases, the line in your line graph may represent a 
summary statistic, like the average, or the point estimate of 
a forecast. If you also want to give a sense of the range (or 
confidence level, depending on the situation), you can do 
that directly on the graph by also visualizing this range. For 
example, the graph in Figure 2.9 shows the minimum, aver¬ 
age, and maximum wait times at passport control for an air¬ 
port over a 13-month period. 


O 00 > 
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Passport control wait time 

Past 13 months 
1o 40 
1 35 


CD 



0 

Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep 
2014 2015 

FIGURE 2.9 Showing average within a range in a line graph 


Slopegraph 

Slopegraphs can be useful when you have two time periods or 
points of comparison and want to quickly show relative increases 
and decreases or differences across various categories between the 
two data points. 

The best way to explain the value of and use case for slopegraphs 
is through a specific example. Imagine that you are analyzing and 
communicating data from a recent employee feedback survey. To 
show the relative change in survey categories from 2014 to 2015, the 
slopegraph might look something like Figure 2.10. 

Slopegraphs pack in a lot of information. In addition to the absolute 
values (the points), the lines that connect them give you the visual 
increase or decrease in rate of change (via the slope or direction) 
without ever having to explain that's what they are doing, or what 
exactly a "rate of change" is—rather, it's intuitive. 
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Employee feedback over time 


Survey category 

Peers 
Culture 
Work environment 



96% 

91% 


75% 


Leadership 

Career development 
Rewards & recognition 
Pert management 



2014 

Survey year 

FIGURE 2.10 Slopegraph 


2015 


Slopegraph template 


S lopegraphs can take a bit of patience to set up because 
they often aren't one of the standard graphs included in 
graphing applications. An Excel template with an example 
slopegraph and instructions for customized use can 
be downloaded here: storytellingwithdata.com/ 
slopegraph-template. 


Whether a slopegraph will work in your specific situation depends 
on the data itself. If many of the lines are overlapping, a slopegraph 
may not work, though in some cases you can still emphasize a single 
series at a time with success. For example, we can draw attention 
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to the single category that decreased over time from the preced¬ 
ing example. 


Employee feedback over time 


Survey category 

Peers 
Culture 
Work environment 



Leadership 

Career development 
Rewards & recognition 
Pert management 



2014 

Survey year 

FIGURE 2.11 Modified slopegraph 


2015 


In Figure 2.11, our attention is drawn immediately to the decrease 
in "Career development," while the rest of the data is preserved for 
context without competing for attention. We will talk about the strat¬ 
egy behind this when we discuss preattentive attributes in Chapter 4. 

While lines work well to show data over time, bars tend to be my 
go-to graph type for plotting categorical data, where information is 
organized into groups. 
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Bars 

Sometimes bar charts are avoided because they are common. This is 
a mistake. Rather, bar charts should be leveraged because they are 
common, as this means less of a learning curve for your audience. 
Instead of using their brain power to try to understand how to read 
the graph, your audience spends it figuring out what information to 
take away from the visual. 

Bar charts are easy for our eyes to read. Our eyes compare the end 
points of the bars, so it is easy to see quickly which category is the 
biggest, which is the smallest, and also the incremental difference 
between categories. Note that, because of how our eyes compare 
the relative end points of the bars, it is important that bar charts 
always have a zero baseline (where the x-axis crosses the y-axis at 
zero), otherwise you get a false visual comparison. 

Consider Figure 2.12 from Fox News. 



FIGURE 2.12 Fox News bar chart 















Bars 


51 


For this example, let's imagine we are back in the fall of 2012. We 
are wondering what will happen if the Bush tax cuts expire. On the 
left-hand side, we have what the top tax rate is currently, 35%, and 
on the right-hand side what it will be as of January 1, at 39.6%. 

When you look at this graph, how does it make you feel about the 
potential expiration ofthe tax cuts? Perhaps worried aboutthe huge 
increase? Let's take a closer look. 

Note that the bottom number on the vertical axis (shown at the far 
right) is not zero, but rather 34. This means that the bars, in theory, 
should continue down through the bottom ofthe page. In fact, the way 
this is graphed, the visual increase is 460% (the heights of the bars are 
35 - 34 = 1 and 39.6 - 34 = 5.6, so (5.6 - 1) / 1 = 460%). If we graph the 
bars with a zero baseline so that the heights are accurately represented 
(35 and 39.6), we get an actual visual increase of 13% ((39.6 - 35) / 35). 
Let's look at a side-by-side comparison in Figure 2.13. 


Non-zero baseline: as originally graphed Zero baseline: as it should be graphed 


IF BUSH TAX CUTS EXPIRE 

TOP TAX RATE 


39.6% 



IF BUSH TAX CUTS EXPIRE 

TOP TAX RATE 


42% 

40% 

40% 

30% 

38% 

20% 

36% 

10% 

34% 

0% 



JAN. 1,2013 


NOW 


JAN. 1, 2013 


FIGURE 2.13 Bar charts must have a zero baseline 


In Figure 2.13, what looked like a huge increase on the left is reduced 
considerably when plotted appropriately. Perhaps the tax increase 
isn't so worrisome, or at least not as severe as originally depicted. 
Because of the way our eyes compare the relative end points of the 
bars, it's important to have the context ofthe entire bar there in order 
to make an accurate comparison. 
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You'll note that a couple of other design changes were made in the 
remake of this visual as well. The y-axis labels that were placed on 
the right-hand side of the original visual were moved to the left (so 
we see how to interpret the data before we get to the actual data). 
The data labels that were originally outside of the bars were pulled 
inside to reduce clutter. If I were plotting this data outside of this spe¬ 
cific lesson, I might omit the y-axis entirely and show only the data 
labels within the bars to reduce redundant information. However, in 
this case, I preserved the axis to make it clear that it begins at zero. 


Graph axis vs. data labels 


W hen graphing data, a common decision to make is 
whether to preserve the axis labels or eliminate the 
axis and instead label the data points directly. In making this 
decision, consider the level of specificity needed. If you want 
your audience to focus on big-picture trends, think about 
preserving the axis but deemphasizing it by making it grey. 

If the specific numerical values are important, it may be bet¬ 
ter to label the data points directly. In this latter case, it's usu¬ 
ally best to omit the axis to avoid the inclusion of redundant 
information. Always consider how you want your audience to 
use the visual and construct it accordingly. 


The rule we've illustrated here is that bar charts must have a zero 
baseline. Note that this rule does not apply to line graphs. With line 
graphs, since the focus is on the relative position in space (rather 
than the length from the baseline or axis), you can get away with a 
nonzero baseline. Still, you should approach with caution—make it 
clear to your audience that you are using a nonzero baseline and 
take context into account so you don't overzoom and make minor 
changes or differences appear significant. 
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Ethics and data visualization 


B ut what if changing the scale on a bar chart or other¬ 
wise manipulating the data better reinforces the point 
you want to make? Misleading in this manner by inaccurately 
visualizing data is not OK. Beyond ethical concerns, it is risky 
territory. All it takes is one discerning audience member to 
notice the issue (for example, the y-axis of a bar chart begin¬ 
ning at something other than zero) and your entire argument 
will be thrown out the window, along with your credibility. 


While we're considering lengths of bars, let's also spend a moment 
on the width of bars. There's no hard-and-fast rule here, but in gen¬ 
eral the bars should be wider than the white space between the bars. 
You don't want the bars to be so wide, however, that your audience 
wants to compare areas instead of lengths. Consider the following 
"Goldilocks" of bar charts: too thin, too thick, and just right. 


Too thin 



FIGURE 2.14 Bar width 


Too thick 

8 


Just right 

8 
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ABODE 


ABODE 


We've discussed some best practices when it comes to bar charts 
in general. Next let's take a look at some different varieties. Having 
a number of bar charts at your disposal gives you flexibility when 
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facing different data visualization challenges. We'll look at the ones 
I think you should be familiar with here. 


Vertical bar chart 

The plain vanilla bar chart is the vertical bar chart, or column chart. 
Like line graphs, vertical barcharts can be single series, two series, or 
multiple series. Note that as you add more series of data, it becomes 
more difficult to focus on one at a time and pull out insight, so use 
multiple series bar charts with caution. Be aware also that there is 
visual grouping that happens as a result of the spacing in barcharts 
having more than one data series. This makes the relative order of 
the categorization important. Considerwhatyou want your audience 
to be able to compare, and structure your categorization hierarchy 
to make that as easy as possible. 

Single series Two series 

8 i 8 


FIGURE 2.15 Barcharts 
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Multiple series 

8 



Cat 1 Cat 2 Cat 3 Cat 4 Cat 5 


Stacked vertical bar chart 

Use cases for stacked vertical bar charts are more limited. They are 
meant to allow you to compare totals across categories and also see 
the subcomponent pieces within a given category. This can quickly 
become visually overwhelming, however—especially given the var¬ 
ied default color schemes in most graphing applications (more to 
come on that). It is hard to compare the subcomponents across the 
various categories once you get beyond the bottom series (the one 
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directly next to the x-axis) because you no longer have a consistent 
baseline to use to compare. This makes it a harder comparison for 
our eyes to make, as illustrated in Figure 2.16. 


Comparing these is easy 
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Cat 1 Cat 2 Cat 3 Cat 4 Cat 5 



Comparing these is hard 
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FIGURE 2.16 Comparing series with stacked bar charts 


The stacked vertical bar chart can be structured as absolute num¬ 
bers (where you plot the numbers directly, as shown in Figure 2.16), 
or with each column summing to 100% (where you plot the percent 
of total for each vertical segment; we'll look at a specific example 
of this in Chapter 9). Which you choose depends on what you are 
trying to communicate to your audience. When you use the 100% 
stacked bar, think about whether it makes sense to also include the 
absolute numbers for each category total (either in an unobtrusive 
way in the graph directly, or possibly in a footnote), which may aid 
in the interpretation of the data. 


Waterfall chart 

The waterfall chart can be used to pull apart the pieces of a stacked 
bar chart to focus on one at a time, or to show a starting point, 
increases and decreases, and the resulting ending point. 

The best way to illustrate the use case for a waterfall chart is through a 
specific example. Imagine that you are an HR business partner and want 
to understand and communicate how employee headcount has changed 
over the past year for the client group you support. 
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A waterfall chart showing this breakdown might look something like 
Figure 2.17. 


2014 Headcount math 

Though more employees transferred out of the team than transferred in, 

aggressive hiring means overall headcount (HC) increased 16% over the course of the year. 



FIGURE 2.17 Waterfall chart 

On the left-hand side, we see what the employee headcount for the 
given team was at the beginning of the year. As we move to the right, 
first we encounterthe incremental additions: new hires and employ¬ 
ees transferring into the team from other parts of the organization. 
This is followed by the deductions: transfers out of the team to other 
parts of the organization and attrition. The final column represents 
employee headcount at the end of the year, after the additions and 
deductions have been applied to the beginning of year headcount. 


Brute-force waterfall charts 


I f your graphing application doesn't have waterfall chart 
functionality built in, fret not. The secret is to leverage the 
stacked bar chart and make the first series (the one that 
appears closest to the x-axis) invisible. It takes a bit of math 
to set up correctly, but it works great. A blog post on this 
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topic, along with an example Excel version of the above 
chart and instructions on how to set one up for your own 
purposes can be downloaded at storytellingwithdata.com/ 
waterfall-chart. 


Horizontal bar chart 

If I had to pick a single go-to graph for categorical data, it would be 
the horizontal bar chart, which flips the vertical version on its side. 
Why? Because it is extremely easy to read. The horizontal barchart is 
especially useful if your category names are long, as the text is writ¬ 
ten from left to right, as most audiences read, making your graph 
legible foryour audience. Also, because of the way we typically pro¬ 
cess information—starting at top left and making z's with our eyes 
across the screen or page—the structure of the horizontal bar chart 
is such that our eyes hit the category names before the actual data. 
This means by the time we get to the data, we already know what it 
represents (instead of the darting back and forth our eyes do between 
the data and category names with vertical bar charts). 

Like the vertical bar chart, the horizontal bar chart can be single 
series, two series, or multiple series (Figure 2.18). 


Single series 
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FIGURE 2.18 Horizontal bar charts 
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The logical ordering of categories 


W hen designing any graph showing categorical data, 

be thoughtful about how your categories are ordered. 

If there is a natural ordering to your categories, it may make 
sense to leverage that. For example, if your categories are 
age groups—0-10 years old, 11-20 years old, and so on— 
keep the categories in numerical order. If, however, there 
isn't a natural ordering in your categories that makes sense to 
leverage, think about what ordering of your data will make the 
most sense. Being thoughtful here can mean providing a 
construct for your audience, easing the interpretation process. 

Your audience (without other visual cues) will typically look 
at your visual starting at the top left and zigzagging in "z" 
shapes. This means they will encounter the top of your graph 
first. If the biggest category is the most important, think 
about putting that first and ordering the rest of the catego¬ 
ries in decreasing numerical order. Or if the smallest is most 
important, put that at the top and order by ascending data 
values. 

For a specific example about the logical ordering of data, 
check out case study 3 in Chapter 9. 


Stacked horizontal bar chart 

Similarto the stacked vertical barchart, stacked horizontal barcharts 
can be used to show the totals across different categories but also 
give a sense of the subcomponent pieces. They can be structured 
to show either absolute values or sum to 100%. 

I find this latter approach can work well for visualizing portions of a 
whole on a scale from negative to positive, because you get a consis¬ 
tent baseline on both the far left and the far right, allowing for easy 
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comparison of the left-most pieces as well as the right-most pieces. 
For example, this approach can work well for visualizing survey data 
collected along a Likert scale (a scale commonly used in surveys that 
typically ranges from Strongly Disagree to Strongly Agree), as shown 
in Figure 2.19. 


Survey results 

Strongly Disagree I Disagree I Neutral I Agree I Strongly Agree 

Percent of total 



FIGURE 2.19 100% stacked horizontal bar chart 


Area 

I avoid most area graphs. Humans' eyes don't do a great job of attrib¬ 
uting quantitative value to two-dimensional space, which can render 
area graphs harder to read than some of the other types of visual 
displays we've discussed. Forthis reason, I typically avoid them, with 
one exception—when I need to visualize numbers of vastly different 
magnitudes. The second dimension you get using a square for this 
(which has both height and width, compared to a bar that has only 
height or width) allows this to be done in a more compact way than 
possible with a single dimension, as shown in Figure 2.20. 
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Interview breakdown 


Out of every 100 

phone screens... 


we bring 25 

candidates onsite 

for interviews... 

and 

extend 9 offers. 

FIGURE 2.20 Square area graph 

Other types of graphs 

What I've covered up to this point are the types of graphs I find 
myself commonly using. This is certainly not an exhaustive list. How¬ 
ever, they should meet the majority of your everyday needs. Mas¬ 
tering the basics is imperative before exploring novel types of data 
visualization. 

There are many other types of graphs out there. When it comes to 
selecting a graph, first and foremost, choose a graph type that will 
enable you to clearly get your message across to your audience. 
With less familiar types of visuals, you will likely need to take extra 
care in making them accessible and understandable. 



Infographics 


/ nfographic is a term that is frequently misused. An info- 
graphic is simply a graphical representation of informa¬ 
tion or data. Visuals coined infographic run the gamut from 
fluffy to informative. On the inadequate end of the spectrum, 
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they often include elements like garish, oversized numbers 
and cartoonish graphics. These designs have a certain visual 
appeal and can seduce the reader. On second glance, how¬ 
ever, they appear shallow and leave a discerning audience 
dissatisfied. Here, the description of "information graphic"— 
though often used—is not appropriate. On the other end of 
the spectrum are infographics that live up to their name and 
actually inform. There are many good examples in the area 
of data journalism (for example, the New York Times and 
National Geographic). 

There are critical questions information designers must be 
able to answer before they begin the design process. These 
are the same questions we've discussed when it comes to 
understanding the context for storytelling with data. Who is 
your audience? What do you need them to know or do? It is 
only after the answers to these questions can be succinctly 
articulated that an effective method of display that will best 
aid the message can be chosen. Good data visualization— 
infographic or otherwise—is not simply a collection of facts 
on a given topic; good data visualization tells a story. 


To be avoided 

We've discussed the visuals that I use most commonly to communi¬ 
cate data in a business setting. There are also some specific graph 
types and elements that you should avoid: pie charts, donut charts, 
3D, and secondary y-axes. Let's discuss each of these. 


Pie charts are evil 

I have a well-documented disdain for pie charts. In short, they are 
evil. To understand how I arrived at this conclusion, let's look at an 
example. 
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The pie chart shown in Figure 2.21 (based on a real example) shows 
market share across four suppliers: A, B, C, and D. If I asked you to 
make a simple observation—which supplier is the largest based on 
this visual—what would you say? 


Supplier Market Share 



FIGURE 2.21 Pie chart 


■ Supplier A 

■ Supplier B 
Supplier C 

■ Supplier D 


Most people will agree that "Supplier B," rendered in medium blue 
at the bottom right, appears to be the largest. If you had to estimate 
what proportion supplier B makes up of the overall market, what per¬ 
cent might you estimate? 

35%? 

40%? 

Perhaps you can tell by my leading questioning that something fishy 
is going on here. Take a look at what happens when we add the num¬ 
bers to the pie segments, as shown in Figure 2.22. 
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Supplier Market Share 


■ Supplier A 

■ Supplier B 
Supplier C 
Supplier D 



FIGURE 2.22 Pie chart with labeled segments 


"Supplier B"—which looks largest, at 31%—is actually smaller than 
"Supplier A" above it, which looks smaller. 

Let's discuss a couple of issues that pose a challenge for accurately 
interpreting this data. The first thing that catches your eye (and sus¬ 
picion, if you're a discerning chart reader) is the 3D and strange per¬ 
spective that's been applied to the graph, tilting the pie and making 
the pieces at the top appear farther away and thus smaller than they 
actually are, while the pieces at the bottom appear closer and thus 
biggerthan they actually are. We'll talk more about 3D soon, but for 
now I'll articulate a relevant data visualization rule: don't use 3D! It 
does nothing good, and can actually do a whole lot of harm, as we 
see here with the way it skews the visual perception of the numbers. 

Even when we strip away the 3D and flatten the pie, interpretation 
challenges remain. The human eye isn't good at ascribing quantita¬ 
tive value to two-dimensional space. Said more simply: p/e charts 
are hard for people to read. When segments are close in size, it's 
difficult (if not impossible) to tell which is bigger. When they aren't 
close in size, the best you can do is determine that one is bigger 
than the other, but you can't judge by how much. To get over this, 
you can add data labels as has been done here. But I'd still argue 
the visual isn't worth the space it takes up. 
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What should you do instead? One approach is to replace the pie 
chart with a horizontal bar chart, as illustrated in Figure 2.23, orga¬ 
nized from greatest to least or vice versa (unless there is some nat¬ 
ural ordering to the categories that makes sense to leverage, as 
mentioned earlier). Remember, with bar charts, our eyes compare 
the end points. Because they are aligned at a common baseline, it is 
easy to assess relative size. This makes it straightforward to see not 
only which segment is the largest, for example, but also how incre¬ 
mentally larger it is than the other segments. 

Supplier Market Share 



FIGURE 2.23 An alternative to the pie chart 

One might argue that you lose something in the transition from pie 
to bar. The unique thing you get with a pie chart is the concept of 
there being a whole and, thus, parts of a whole. But if the visual is 
difficult to read, is it worth it? In Figure 2.23, I've tried to address 
this by showing that the pieces sum to 100%. It isn't a perfect solu¬ 
tion, but something to consider. For more alternatives to pie charts, 
check out case study 5 in Chapter 9. 

If you find yourself using a pie chart, pause and ask yourself: why? 
If you're able to answer this question, you've probably put enough 
thought into it to use the pie chart, but it certainly shouldn't be the 
first type of graph that you reach for, given some of the difficulties 
in visual interpretation we've discussed here. 
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While we're on the topic of pie charts, let's look quickly at another 
"dessert visual" to avoid: the donut chart. 

The donut chart 



FIGURE 2.24 Donut chart 

With pies, we are asking our audience to compare angles and areas. 
With a donut chart, we are asking our audience to compare one arc 
length to another arc length (for example, in Figure 2.24, the length 
of arc A compared to arc B). How confident do you feel in your eyes' 
ability to ascribe quantitative value to an arc length? 

Not very? That's what I thought. Don't use donut charts. 

Never use 3D 

One of the golden rules of data visualization goes like this: never use 
3D. Repeat after me: never use 3D. The only exception is if you are 
actually plotting a third dimension (and even then, things get really 
tricky really quickly, so take care when doing this)—and you should 
never use 3D to plot a single dimension. As we saw in the pie chart 
example previously, 3D skews our numbers, making them difficult 
or impossible to interpret or compare. 

Adding 3D to graphs introduces unnecessary chart elements like 
side and floor panels. Even worse than these distractions, graphing 
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applications do some pretty strange things when it comes to plotting 
values in 3D. For example, in a 3D barchart, you might thinkthatyour 
graphing application plots the front of the bar or perhaps the back of 
the bar. Unfortunately, it's often even less straightforward than that. 
In Excel, for example, the bar height is determined by an invisible 
tangent plane intersecting the corresponding height on the y-axis. 
This gives rise to graphs like the one shown in Figure 2.25. 


Number of issues 
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FIGURE 2.25 3D column chart 


Jun 


Judging by Figure 2.25, how many issues were there in January and 
February? I've plotted a single issue for each of these months. How¬ 
ever, the way I read the chart, if I compare the bar height to the grid- 
lines and follow it leftward to the y-axis, I'd estimate visually a value 
of maybe 0.8. This is simply bad data visualization. Don't use 3D. 


Secondary y-axis: generally not a good idea 

Sometimes it's useful to be able to plot data that is in entirely differ¬ 
ent units against the same x-axis. This often gives rise to the second¬ 
ary y-axis: another vertical axis on the right-hand side of the graph. 
Consider the example shown in Figure 2.26. 
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Secondary y-axis 


Revenue Size of Salesforce 



FIGURE 2.26 Secondary y-axis 


When interpreting Figure 2.26, it takes some time and reading to 
understand which data should be read against which axis. Because 
of this, you should avoid the use of a secondary or right-hand y-axis. 
Instead, think about whether one of the following approaches will 
meet your needs: 

1. Don't show the second y-axis. Instead, label the data points that 
belong on this axis directly. 

2. Pull the graphs apart vertically and have a separate y-axis for each 
(both along the left) but leverage the same x-axis across both. 


Figure 2.27 illustrates these options. 
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Alternative 1 : label directly 


Alternative 2: pull apart vertically 



FIGURE 2.27 Strategies for avoiding a secondary y-axis 


A third potential option not shown here is to linkthe axis to the data 
to be read against it through the use of color. For example, in the 
original graph depicted in Figure 2.26, I could write the left y-axis 
title "Revenue" in blue and keep the revenue bars blue while at the 
same time writing the right y-axis title "# of Sales Employees" in 
orange and making the line graph orange to tie these together visu¬ 
ally. I don't recommend this approach because color can typically 
be used more strategically. We'll spend a lot more time discussing 
color in Chapter 4. 

It is also worth noting that when you display two datasets against 
the same axis, it can imply a relationship that may or may not exist. 
This is something to be aware of when determining whether this is 
an appropriate approach in the first place. 

When you're facing a secondary y-axis challenge and considering 
which alternative shown in Figure 2.27 will better meet your needs, 
think about the level of specificity you need. Alternative 1, where each 
data point is labeled explicitly, puts more attention on the specific 
numbers. Alternative 2, where the axes are shown at the left, puts 
more focus on the overarching trends. In general, avoid a second¬ 
ary y-axis and instead employ one of these alternate approaches. 
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In closing 

In this chapter, we've explored the types of visual displays I find myself 
using most. There will be use cases for other types of visuals, but what 
we've covered here should meet the majority of everyday needs. 

In many cases, there isn't a single correct visual display; rather, often 
there are different types of visuals that could meet a given need. 
Drawing from the previous chapter on context, most important is 
to have that need clearly articulated: What do you need your audi¬ 
ence to know? Then choose a visual display that will enable you to 
make this clear. 

If you're wondering What is the right graph for my situation?, the 
answer is always the same: whatever will be easiest for your audi¬ 
ence to read. There is an easy way to test this, which is to create your 
visual and show it to a friend or colleague. Have them articulate the 
following as they process the information: where they focus, what 
they see, what observations they make, what questions they have. 
This will help you assess whether your visual is hitting the mark, or 
in the case where it isn't, help you know where to concentrate your 
changes. 

You now know the second lesson of storytelling with data: how to 

choose an appropriate visual display. 



chapter three 
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Picture a blank page or a blank screen: every single element you 
add to that page or screen takes up cognitive load on the part of 
your audience—in other words, takes them brain power to process. 
Therefore, we want to take a discerning look at the visual elements 
that we allow into our communications. In general, identify anything 
that isn't adding informative value—or isn't adding enough informa¬ 
tive value to make up for its presence—and remove those things. 
Identifying and eliminating such clutter is the focus of this chapter. 


Cognitive load 

You have felt the burden of cognitive load before. Perhaps you were 
sitting in a conference room as the person leading the meeting was 
flipping through their projected slides and they paused on one that 
looked overwhelmingly busy and complicated. Yikes, did you say 
"ugh" out loud, or was that just in your head? Or maybe you were 
reading through a report or the newspaper, and a graph caught 
your eye just long enough for you to think, "this looks interesting 
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but I have no idea what I'm meant to get out of it"—and rather than 
spend more time to decipher it, you turned the page. 

In both of these instances, what you've experienced is excessive or 
extraneous cognitive load. 

We experience cognitive load anytime we take in information. Cog¬ 
nitive load can be thought of as the mental effort that's required to 
learn new information. When we ask a computer to do work, we are 
relying on the computer's processing power. When we ask our audi¬ 
ence to do work, we are leveraging their mental processing power. 
This is cognitive load. Humans' brains have a finite amount of this 
mental processing power. As designers of information, we want to be 
smart about how we use our audience's brain power. The preceding 
examples point to extraneous cognitive load: processing that takes 
up mental resources but doesn't help the audience understand the 
information. This is something we want to avoid. 


The data-ink or signal-to-noise ratio 


A number of concepts have been introduced over time 
in an effort to explain and help provide guidance for 
reducing the cognitive load we push to our audience through 
our visual communications. In his book The Visual Display of 
Quantitative Information, Edward Tufte refers to maximizing 
the data-ink ratio, saying "the larger the share of a graphic's 
ink devoted to data, the better (other relevant matters being 
equal)." This can also be referred to as maximizing the signal- 
to-noise ratio (see Nancy Duarte's book Resonate ), where the 
signal is the information we want to communicate, and the 
noise are those elements that either don't add to, or in some 
cases detract from, the message we are trying to impart to 
our audience. 
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What matters most when it comes to our visual communications is 
the perceived cognitive load on the part of our audience: how hard 
they believe they are going to have to work to get the information 
out of your communication. This is a decision they likely reach with¬ 
out giving it much (if any) conscious thought, and yet it can make the 
difference between getting your message across or not. 

In general, think about minimizing the perceived cognitive load (to 
the extent that is reasonable and still allows you to get the informa¬ 
tion across) for your audience. 


Clutter 

One culprit that can contribute to excessive or extraneous cognitive 
load is something I refer to simply as clutter. These are visual ele¬ 
ments that take up space but don't increase understanding. We'll 
take a more specific look at exactly what elements can be consid¬ 
ered clutter soon, but in the meantime I want to talk generally about 
why clutter is a bad thing. 

There is a simple reason we should aim to reduce clutter: because it 
makes our visuals appear more complicated than necessary. 

Perhaps without explicitly recognizing it, the presence of clutter in 
our visual communications can cause a less-than-ideal—or worse— 
uncomfortable user experience for our audience (this is that "ugh" 
moment I referred to at the beginning of this chapter). Clutter can 
make something feel more complicated than it actually is. When our 
visuals feel complicated, we run the risk of our audience deciding 
they don't want to take the time to understand what we're showing, 
at which point we've lost our ability to communicate with them. This 
is not a good thing. 
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Gestalt principles of visual perception 

When it comes to identifying which elements in our visuals are signal 
(the information we want to communicate) and which might be noise 
(clutter), consider the Gestalt Principles of Visual Perception. The 

Gestalt School of Psychology set out in the early 1900s to understand 
how individuals perceive order in the world around them. What they 
came away with are the principles of visual perception still accepted 
today that define how people interact with and create order out of 
visual stimuli. 

We'll discuss six principles here: proximity, similarity, enclosure, clo¬ 
sure, continuity, and connection. For each, I'll show an example of 
the principle applied to a table or graph. 


Gestalt principles of visual perception 
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Proximity 

We tend to think of objects that are physically close together as 
belonging to part of a group. The proximity principle is demon¬ 
strated in Figure 3.1: you naturally see the dots as three distinct 
groups because of their relative proximity to each other. 


• • 


FIGURE 3.1 Gestalt principle of proximity 

We can leverage this way that people see in table design. In Fig¬ 
ure 3.2, simply by virtue of differentiating the spacing between the 
dots, your eyes are drawn either down the columns in the first case 
or across the rows in the second case. 


FIGURE 3.2 You see columns and rows, simply due to dot spacing 
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Similarity 

Objects that are of similar color, shape, size, or orientation are per¬ 
ceived as related or belonging to part of a group. In Figure 3.3, you 
naturally associate the blue circles together on the left or the grey 
squares together on the right. 




X 


■ X 



FIGURE 3.3 Gestalt principle of similarity 


This can be leveraged in tables to help draw our audience's eyes in 
the direction we want them to focus. In Figure 3.4, the similarity of 
color is a cue for our eyes to read across the rows (rather than down 
the columns). This eliminates the need for additional elements such 
as borders to help direct our attention. 


FIGURE 3.4 You see rows due to similarity of color 
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Enclosure 

We think of objects that are physically enclosed together as belong¬ 
ing to part of a group. It doesn't take a very strong enclosure to do 
this: light background shading is often enough, as demonstrated in 
Figure 3.5. 



FIGURE 3.5 Gestalt principle of enclosure 


One way we can leverage the enclosure principle is to draw a visual 
distinction within our data, as is done in the graph in Figure 3.6. 



FIGURE 3.6 The shaded area separates the forecast from actual data 
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Closure 

The closure concept says that people like things to be simple and 
to fit in the constructs that are already in our heads. Because of this, 
people tend to perceive a set of individual elements as a single, rec¬ 
ognizable shape when they can—when parts of a whole are miss¬ 
ing, our eyes fill in the gap. For example, the elements in Figure 3.7 
will tend to be perceived as a circle first and only after that as indi¬ 
vidual elements. 



FIGURE 3.7 Gestalt principle of closure 

It is common for graphing applications (for example, Excel) to 
have default settings that include elements like chart borders and 
background shading. The closure principle tells us that these are 
unnecessary—we can remove them and our graph still appears as a 
cohesive entity. Bonus: when we take away those unnecessary ele¬ 
ments, our data stands out more, as shown in Figure 3.8. 




FIGURE 3.8 The graph still appears complete without the border and 
background shading 
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Continuity 

The principle of continuity is similar to closure: when looking at 
objects, our eyes seek the smoothest path and naturally create 
continuity in what we see even where it may not explicitly exist. By 
way of example, in Figure 3.9, if I take the objects (1) and pull them 
apart, most people will expect to see what is shown next (2), whereas 
it could as easily be what is shown after that (3). 


1 2 3 



FIGURE 3.9 Gestalt principle of continuity 


In the application of this principle, I've removed the vertical y-axis 
line from the graph in Figure 3.10 altogether. Your eyes actually still 
see that the bars are lined up at the same point because of the con¬ 
sistent white space (the smoothest path) between the labels on the 
left and the data on the right. As we saw with the closure principle 
in application, stripping away unnecessary elements allows our data 
to stand out more. 


A 

B 

C 

D 

E 


FIGURE 3.10 Graph with y-axis line removed 
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Connection 

The final Gestalt principle we'll focus on is connection. We tend to 
think of objects that are physically connected as part of a group. The 
connective property typically has a stronger associative value than 
similar color, size, or shape. Note when looking at Figure 3.11, your 
eyes probably pairthe shapes connected by lines (ratherthan similar 
color, size, or shape): that's the connection principle in action. The 
connective property isn't typically stronger than enclosure, but you 
can impact this relationship through thickness and darkness of lines 
to create the desired visual hierarchy (we'll talk more about visual 
hierarchy when we discuss preattentive attributes in Chapter 4). 



FIGURE 3.11 Gestalt principle of connection 

One way that we frequently leverage the connection principle is 
in line graphs, to help our eyes see order in the data, as shown in 
Figure 3.12. 

• • 

• • 

FIGURE 3.12 Lines connect the dots 



As you have learned from this brief overview, the Gestalt principles 
help us understand how people see, which we can use to identify 
unnecessary elements and ease the processing of our visual 
communications. We aren't done with them yet. At the end of this 
chapter, we'll discuss how we can apply some of these principles to 
a real-world example. 
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But first, let's shift our focus to a couple of other types of visual clutter. 


Lack of visual order 

When design is thoughtful, it fades into the background so that 
your audience doesn't even notice it. When it's not, however, your 
audience feels the burden. Let's look at an example to understand 
the impact visual order—and lack thereof—can have on our visual 
communications. 

Take a moment to study Figure 3.13, which summarizes survey feed¬ 
back about factors considered by nonprofits in vendor selection. 
Note specifically any observations you may have regarding the 
arrangement of elements on the page. 


Demonstrating effectiveness is most important consideration when 
selecting a provider 


In general, what attributes are the most important 
to you in selecting a service provider? 
(Choose up to 3) 


Demonstration of results 
Content expertise 
Local knowledge 
National reputation 
Affordability of services 
Previous work together 


Colleague recommendation 



Survey shows that 
demonstration of results is 
the single most important 
dimension when choosing a 
service provider. 


Affordability and experience 
working together previously, 
which were hypothesized to 
be very important in the 
decision making process, 
were both cited less 
§\° frequently as important 
attributes. 


% selecting given attribute 


Data source: xyz; includes N number of survey respondents. Note that 
respondents were able to choose up to 3 options. 


FIGURE 3.13 Summary of survey feedback 


As you look over the information, you might be thinking, "this looks 
pretty good." I'll concede: it's not horrible. On the positive side, the 
takeaway is clearly outlined, the graph is well ordered and labeled, 
and key observations are articulated and tied visually to where we're 
meant to look in the graph. But when it comes to the overall design 
of the page and placement of elements, I'd have to disagree with any 
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praise. To me, the aggregate visual feels disorganized and uncom¬ 
fortable to look at, as if the various components were haphazardly 
put there without regard for the structure of the overall page. 

We can improve this visual markedly by making some relatively minor 
changes. Take a look at Figure 3.14. The content is exactly the same; 
only the placement and formatting of elements have been modified. 


Demonstrating effectiveness is most important consideration 
when selecting a provider 


In general, what attributes are the most important 

to you in selecting a service provider? 


(Choose up to 3) % selecting given attribute 

0% 20% 40% 60% 80% 


Demonstration of results 


Content expertise 


Local knowledge 


Survey shows that demonstration 
of results is the single most 
important dimension when 
choosing a service provider. 


National reputation 
Affordability of services 
Previous work together 
Colleague recommendation 


Affordability and experience 
working together previously, 

which were hypothesized to be 
very important in the decision 
making process, were both cited 
less frequently as important attributes. 


Data source: xyz; includes N number of survey respondents. 
Note that respondents were able to choose up to 3 options. 


FIGURE 3.14 Revamped summary of survey feedback 


Compared to the original visual, the second iteration feels somehow 
easier. There is order. It is evident that conscious thought was paid 
to the overarching design and arrangement of components. Spe¬ 
cifically, the latter version has been designed with greater attention 
to alignment and white space. Let's look at each of these in detail. 


Alignment 

The single change having the biggest impact in the preceding 
before-and-after example was the shift from center-aligned to left- 
justified text. In the original version, each block of text on the page 
is center-aligned. This does not create clean lines either on the left 
or on the right, which can make even a thoughtful layout appear 
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sloppy. I tend to avoid center-aligned text for this reason. The deci¬ 
sion of whether to left- or right-justify your text should be made in 
context of the other elements on the page. In general, the goal is 
to create clean lines (both horizontally and vertically) of elements 
and white space. 


Presentation software tips for aligning elements 


T o help ensure that your elements line up when you are 
placing them on a page within your presentation soft¬ 
ware, turn on the rulers or gridlines that are built into most 
programs. This will allow you to precisely align your elements 
to create a cleaner look and feel. The table functionality built 
into most presentation applications can also be used as a 
makeshift brute-force method: create a table to give yourself 
guidelines for the placement of discrete elements. When you 
have everything lined up exactly like you want it, remove the 
table or make the table's borders invisible so that all that is 
left is your perfectly arranged page. 


Without other visual cues, your audience will typically start at the top 
left of the page or screen and will move their eyes in a "z" shape (or 
multiple "z" shapes, depending on the layout) across the page or 
screen as they take in information. Because of this, when it comes to 
tables and graphs, I like to upper-left-most justify the text (title, axis 
titles, legend). This means the audience will hit the details that tell 
them how to read the table or graph before they get to the data itself. 

As part of our discussion on alignment, let's spend a bit of time on 
diagonal components. In the previous example, the original version 
(Figure 3.13) had diagonal lines connecting the takeaways to the data 
and diagonally oriented x-axis labels; the former were removed and 
the latter changed to horizontal orientation in the makeover (Figure 
3.14). Generally, diagonal elements such as lines and text should be 
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avoided. They look messy and, in the case of text, are harder to read 
than their horizontal counterparts. When it comes to the orienta¬ 
tion of text, one study (Wigdor & Balakrishnan, 2005) found that the 
reading of rotated text 45 degrees in either direction was, on aver¬ 
age, 52% slower than reading normally oriented text (text rotated 
90 degrees in either direction was 205% slower on average). It is best 
to avoid diagonal elements on the page. 


White space 

I've never quite understood this phenomenon, but for some reason, 
people tend to fear white space on a page. I use "white space" to 
refer to blank space on the page. If your pages are blue, for exam¬ 
ple, this would be "blue space"—I'm not sure why they would be 
blue, but the use of color is a conversation we will have later. Per¬ 
haps you've heard this feedback before: "there is still some space 
left on that page, so let's add something there," or worse, "there 
is still some space left on that page, so let's add more data." No! 
Never add data just forthe sake of adding data—only add data with 
a thoughtful and specific purpose in mind! 

We need to get more comfortable with white space. 

White space in visual communication is as important as pauses in 
public speaking. Perhaps you have sat through a presentation that 
lacked pauses. It feels something like this: there is a speaker up in 
front of you and possibly due to nerves or perhaps because they're 
trying to get through more material than they should in the allot¬ 
ted time they are speaking a mile a minute and you're wondering 
how they're even able to breathe you'd like to ask a question but 
the speaker has already moved on to the next topic and still hasn't 
paused long enough for you to be able to raise your question. This 
is an uncomfortable experience forthe audience, similar to the dis¬ 
comfort you may have felt reading through the preceding run-on, 
unpunctuated sentence. 
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Now imagine the effect if that same presenter were to make a single 
bold statement: "Death to pie charts!" 

And then pause for a full 15 seconds to let that statement resonate. 


Go ahead—say it out loud and then count to 15 slowly. 


That's a dramatic pause. 


And it got your attention, didn't it? 

That is the same powerful effect that white space used strategically 
can have on our visual communications. The lack of it—like the lack 
of pauses in a spoken presentation—is simply uncomfortable for 
our audience. Audience discomfort in response to the design of our 
visual communications is something we should aim to avoid. White 
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space can be used strategically to draw attention to the parts of the 
page that are not white space. 

When it comes to preserving white space, here are some minimal 
guidelines. Margins should remain free of text and visuals. Resist the 
urge to stretch visuals to take up the available space; instead, appro¬ 
priately size your visuals to their content. Beyond these guidelines, 
think about how you can use white space strategically for empha¬ 
sis, as was illustrated with the dramatic pause earlier. If there is one 
thing that is really important, think about making that the only thing 
on the page. In some cases, this could be a single sentence or even 
a single number. We'll talk further about using white space strategi¬ 
cally and look at an example when we discuss aesthetics in Chapter 5. 


Non-strategic use of contrast 

Clear contrast can be a signal to our audience, helping them under¬ 
stand where to focus their attention. We will further explore this 
idea in greater detail in later chapters. The lack of clear contrast, 
on the other hand, can be a form of visual clutter. When discussing 
the critical value of contrast, there is an analogy I often borrow from 
Colin Ware ( Information Visualization: Perception for Design, 2004), 
who said it's easy to spot a hawk in a sky full of pigeons, but as the 
variety of birds increases, that hawk becomes harder and harder to 
pick out. This highlights the importance of the strategic use of con¬ 
trast in visual design: the more things we make different, the lesser 
the degree to which any of them stand out. To explain this another 
way, if there is something really important we want our audience to 
know or see (the hawk), we should make that the one thing that is 
very different from the rest. 

Let's look at an example to further illustrate this concept. 

Imagine you work for a U.S. retailer and want to understand how 
your customers feel about various dimensions of their shopping 
experience in your store compared to your competitors. You have 
conducted a survey to collect this information and are now trying 
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to understand what it tells you. You have created a weighted per¬ 
formance index to summarize each category of interest (the higher 
the index, the better the performance, and vice versa). Figure 3.15 
shows the weighted performance index across categories for your 
company and five competitors. 

Study it for a moment and make note of your thought process as 
you take in the information. 


1.50 

1.00 

0.50 

o.oo 

(0.50) 

( 1 . 00 ) 

(1.50) 


Weighted Performance Index 



Selection Convenience Service Relationship Price 


♦ Our Business A Competitor A Competitor B ♦Competitor C x Competitor □ • Competitor E 

FIGURE 3.15 Original graph 


If you had to describe Figure 3.15 in a single word, what would that 
word be? Words like busy, confusing, and perhaps exhausting come 
to mind. There is a lot going on in this graph. So many things are 
competing for our attention that it is hard to know where to look. 

Let's review exactly what we're looking at. As I mentioned, the data 
graphed is a weighted performance index. You don't need to worry 
about the details of how this is calculated, but rather understand 
that this is a summary performance metric that we'd like to com¬ 
pare across various categories (shown across the horizontal x-axis: 
Selection, Convenience, Service, Relationship, and Price) for "Our 
Business" (depicted by the blue diamond) compared to a number 
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of competitors (the other colored shapes). A higher index represents 
better performance, and a lower index means lower performance. 

Taking in this information is a slow process, with a lot of back and 
forth between the legend at the bottom and the data in the graph 
to decipher what is being conveyed. Even if we are very patient and 
really want to get information out of this visual, it is nearly impossible 
because "Our Business" (the blue diamond) is sometimes obscured 
by other data points, making it so we can't even see the comparison 
that is most important to make! 

This is a case where lack of contrast (as well as some other design 
issues) makes the information much harder to interpret than it need 
be. 

Consider Figure 3.16, where we use contrast more strategically. 

Performance overview 

■ Our business 

■ Competitor A 

■ Competitor B 

■ Competitor C 

■ Competitor D 

■ Competitor E 


Relationship 

Service 


Selection 




FIGURE 3.16 Revamped graph, using contrast strategically 
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In the revised graph, I've made a number of changes. First, I chose a 
horizontal bar chart to depict the information. In doing so, I rescaled 
all the numbers to be on a positive scale—in the original scatterplot, 
there were some negative values that complicated the visualization 
challenge. This change works here since we're more interested in rel¬ 
ative differences than absolute values. In this remake, the categories 
that were previously along the horizontal x-axis now run down the 
vertical y-axis. Within each category, the length of the bar shows the 
summary metric across "Our business" (blue) and the various com¬ 
petitors (grey), with longer bars representing better performance. 
The decision not to show the actual x-axis scale in this case was a 
deliberate one, which forces the audience to focus on relative dif¬ 
ferences rather than get caught up in the minutiae of the specific 
numbers. 

With this design, it is easy to see two things quickly: 

1. We can let our eyes scan across the blue bars to get a relative 
sense of how "Our business" is doing across the various cate¬ 
gories: we score high on Price and Convenience and lower on 
Relationship, possibly because we're struggling when it comes to 
Service and Selection, as evidenced by low scores in these areas. 

2. Within a given category, we can compare the blue bar to the 
grey bars to see how our business is faring relative to competi¬ 
tors: winning compared to the competition on Price, losing on 
Service and Selection. 

Competitors are distinguished from each other based on the order 
in which they appear (Competitor A always appears directly after the 
blue bar, Competitor B after that, and so on), which is outlined in 
the legend at the left. If it were important to be able to quickly iden¬ 
tify each competitor, this design doesn't immediately allow for that. 
But if that is a second- or third-order comparison in terms of priority 
and isn't the most critical thing, this approach can work well. In the 
makeover, I've also organized the categories in order of decreasing 
weighted performance index for "Our business," which provides a 
construct for our audience to use as they take in the information, 
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and added a summary metric (relative rank) so it's easy to know 
quickly how "Our business" ranks in each category in relation to 
our competition. 

Note here how the effective use of contrast (and some otherthought- 
ful design choices) makes it a much faster, easier, and just more com¬ 
fortable-feeling process to get the information we're after than it 
was in the original graph. 


When redundant details shouldn't be considered 
clutter 


I 've seen cases where the title of the visual indicates the val¬ 
ues are dollars but the dollar signs aren't included with the 
actual numbers in the table or graph. For example, a graph 
titled "Monthly Sales ($USD Millions)" with y-axis labels of 
10, 20, 30, 40, 50. I find this confusing. Including the "$" 
sign with each number eases the interpretation of the fig¬ 
ures. Your audience doesn't have to remember they are look¬ 
ing at dollars because they are labeled explicitly. There are 
some elements that should always be retained with numbers, 
including dollar signs, percent signs, and commas in large 
numbers. 


Decluttering: step-by-step 

Now that we have discussed what clutter is, why it is important to 
eliminate it from our visual communications, and how to recognize 
it, let's look at a real-world example and examine how the process 
of identifying and removing clutter improves our visual and the clar¬ 
ity of the story that we're ultimately trying to tell. 

Scenario: Imagine that you manage an information technology (IT) 
team. Your team receives tickets, or technical issues, from employees. 
In the past year, you've had a couple of people leave and decided 
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at the time not to replace them. You have heard a rumbling of com¬ 
plaints from the remaining employees about having to "pick up the 
slack." You've just been asked about your hiring needs for the com¬ 
ing year and are wondering if you should hire a couple more people. 
First, you want to understand what impact the departure of individ¬ 
uals over the past year has had on your team's overall productivity. 
You plot the monthly trend of incoming tickets and those processed 
overthe past calendar year. You see that there is some evidence your 
team's productivity is suffering from being short-staffed and now 
want to turn the quick-and-dirty visual you created into the basis for 
your hiring request. 

Figure 3.17 shows your original graph. 



FIGURE 3.17 Original graph 


Take another look at this visual with an eye toward clutter. Consider 
the lessons we've covered on Gestalt principles, alignment, white 
space, and contrast. What things can we get rid of or change? How 
many issues can you identify? 

I identified six major changes to reduce clutter. Let's discuss each. 
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1. Remove chart border 

Chart borders are usually unnecessary, as we covered in our discus¬ 
sion of the Gestalt principle of closure. Instead, think about using 
white space to differentiate the visual from other elements on the 
page as needed. 



FIGURE 3.18 Remove chart border 
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2. Remove gridlines 

If you think it will be helpful for your audience to trace their finger 
from the data to the axis, or you feel that your data will be more 
effectively processed, you can leave the gridlines. But make them 
thin and use a light color like grey. Do not let them compete visually 
with your data. When you can, get rid of them altogether: this allows 
for greater contrast, and your data will stand out more. 



FIGURE 3.19 Remove gridlines 
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3. Remove data markers 

Remember, every single element adds cognitive load on the part of 
your audience. Here, we're adding cognitive load to process data 
that is already depicted visually with the lines. This isn't to say that 
you should never use data markers, but rather use them on purpose 
and with a purpose, ratherthan because their inclusion is your graph¬ 
ing application's default. 
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FIGURE 3.20 Remove data markers 
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4. Clean up axis labels 

One of my biggest pet peeves is trailing zeros on y-axis labels: they 
carry no informative value, and yet make the numbers look more 
complicated than they are! Get rid of them, reducing their unnec¬ 
essary burden on the audience's cognitive load. We can also abbre¬ 
viate the months of the year so that they will fit horizontally on the 
x-axis, eliminating the diagonal text. 



FIGURE 3.21 Clean up axis labels 
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5. Label data directly 

Nowthat we have eliminated much of the extraneous cognitive load, 
the work of going back and forth between the legend and the data 
is even more evident. Remember, we want to try to identify anything 
that will feel like effort to our audience and take that work upon our¬ 
selves as the designers of the information. In this case, we can lever¬ 
age the Gestalt principle of proximity and put the data labels right 
next to the data they describe. 



FIGURE 3.22 Label data directly 
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6. Leverage consistent color 

While we leveraged the Gestalt principle of proximity in the prior 
step, let's also think about leveraging the Gestalt principle of sim¬ 
ilarity and make the data labels the same color as the data they 
describe. This is another visual cue to our audience that says, "these 
two pieces of information are related." 
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FIGURE 3.23 Leverage consistent color 
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This visual is not yet complete. But identifying and eliminating the 
clutter has brought us a long way in terms of reducing cognitive 
load and improving accessibility. Take a look at the before-and-after 
shown in Figure 3.24. 




FIGURE 3.24 Before-and-after 

In closing 

Anytime you put information in front of your audience, you are creat¬ 
ing cognitive load and asking them to use their brain power to pro¬ 
cess that information. Visual clutter creates excessive cognitive load 
that can hinder the transmission of our message. The Gestalt Prin¬ 
ciples of Visual Perception can help you understand how your audi¬ 
ence sees and allow you to identify and remove unnecessary visual 
elements. Leverage alignment of elements and maintain white space 
to help make the interpretation of your visuals a more comfortable 
experience for your audience. Use contrast strategically. Clutter is 
your enemy: ban it from your visuals! 

You now know how to identify and eliminate clutter. 















chapter four 


focus your audience's 

attention 


In the previous chapter, we learned about clutter and the importance 
of identifying and removing it from our visuals. While we work to elim¬ 
inate distractions, we also want to look at what remains and consider 
howwewantouraudienceto interactwith ourvisual communications. 

In this chapter, we further examine how people see and how you can 
use that to your advantage when crafting visuals. We will talk briefly 
about sight and memory in orderto highlightthe importance of some 
specific, powerful tools: preattentive attributes. We will explore how 
preattentive attributes like size, color, and position on page can be 
used strategically in two ways. First, preattentive attributes can be 
leveraged to help direct your audience's attention to where you want 
them to focus it. Second, they can be used to create a visual hierarchy 
of elements to lead your audience through the information you want 
to communicate in the way you want them to process it. 
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By understanding how our audience sees and processes informa¬ 
tion, we put ourselves in a better position to be able to communi¬ 
cate effectively. 


You see with your brain 

Let's look at a simplified picture of how people see, depicted in 
Figure 4.1. The process goes something like this: light reflects off of 
a stimulus. This gets captured by our eyes. We don't fully see with 
our eyes; there is some processing that happens there, but mostly 
it is what happens in our brain that we think of as visual perception. 



FIGURE 4.1 A simplified picture of how you see 


A brief lesson on memory 

Within the brain, there are three types of memory that are important 
to understand as we design visual communications: iconic memory, 
short-term memory, and long-term memory. Each plays an impor¬ 
tant and distinct role. What follows are basic explanations of highly 
complex processes, covered simply to set the stage for what you 
need to know when designing visual communications. 






A brief lesson on memory 
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Iconic memory 

Iconic memory is super fast. It happens without you consciously real¬ 
izing it and is piqued when we look at the world around us. Why? 
Long ago in the evolutionary chain, predators helped our brains 
develop in ways that allowed for great efficiency of sight and speed 
of response. In particular, the ability to quickly pick up differences 
in our environment—for example, the motion of a predator in the 
distance—became ingrained in our visual process. These were sur¬ 
vival mechanisms then; they can be leveraged for effective visual 
communication today. 

Information stays in your iconic memory for a fraction of a second 
before it gets forwarded on to your short-term memory. The impor¬ 
tant thing about iconic memory is that it is tuned to a set of preat- 
tentive attributes. Preattentive attributes are critical tools in your 
visual design tool belt, so we'll come back to those in a moment. In 
the meantime, let's continue our discussion on memory. 


Short-term memory 

Short-term memory has limitations. Specifically, people can keep 
about four chunks of visual information in their short-term memory 
at a given time. This means that if we create a graph with ten differ¬ 
ent data series that are ten different colors with ten different shapes 
of data markers and a legend off to the side, we're making our audi¬ 
ence work very hard going back and forth between the legend and 
the data to decipher what they are looking at. As we've discussed 
previously, to the extent possible, we want to limit this sort of cogni¬ 
tive burden on our audience. We don't want to make our audience 
work to get at the information, because in doing so, we run the risk of 
losing their attention. With that, we lose our ability to communicate. 

In this specific situation, one solution is to label the various data 
series directly (reducing that work of going back and forth between 
the legend and the data by leveraging the Gestalt principle of prox¬ 
imity that we covered in Chapter 3). More generally, we want to form 
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larger, coherent chunks of information so that we can fit them into 
the finite space in our audience's working memory. 


Long-term memory 

When something leaves short-term memory, it either goes into obliv¬ 
ion and is likely lost forever, or is passed into long-term memory. 
Long-term memory is built up over a lifetime and is vitally impor¬ 
tant for pattern recognition and general cognitive processing. Long¬ 
term memory is the aggregate of visual and verbal memory, which 
act differently. Verbal memory is accessed by a neural net, where the 
path becomes important for being able to recognize or recall. Visual 
memory, on the other hand, functions with specialized structures. 

There are aspects of long-term memory that we want to make use 
of when it comes to having our message stick with our audience. Of 
particular importance to our conversation is that images can help us 
more quickly recall things stored in our long-term verbal memory. For 
example, if you see a picture of the Eiffel Tower, a flood of concepts 
you know about, feelings you have toward, or experiences you've 
had in Paris may be triggered. By combining the visual and verbal, 
we set ourselves up for success when it comes to triggering the for¬ 
mation of long-term memories in our audience. We'll discuss some 
specific tactics for this in Chapter 7 in the context of storytelling. 


Preattentive attributes signal where to look 

In the previous section, I introduced iconic memory and mentioned 
that it is tuned to preattentive attributes. The best way to prove the 
power of preattentive attributes is to demonstrate it. Figure 4.2 shows 
a block of numbers. Taking note of how you process the information 
and how long it takes, quickly count the number of 3s that appear 
in the sequence. 


Preattentive attributes signal where to look 
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756395068473 

658663037576 

860372658602 

846589107830 

FIGURE 4.2 Count the 3s example 

The correct answer is six. In Figure 4.2, there were no visual cues to 
help you reach this conclusion. This makes for a challenging exer¬ 
cise, during which you have to hunt through four lines of text, look¬ 
ing for the number 3 (a kind of complicated shape). 


Checkout what happens when we make a single change to the block 
of numbers. Turn the page and repeat the exercise of counting the 
3s using Figure 4.3. 
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75639506847 3 
658663037576 
860372658602 
846589107£30 

FIGURE 4.3 Count the 3s example with preattentive attributes 

Note how much easier and faster the same exercise is using Figure 
4.3. You don't have time to blink, don't really have time to think, and 
suddenly there are six 3s in front of you. This is so apparent so quickly 
because in this second iteration, your iconic memory is being lev¬ 
eraged. The preattentive attribute of intensity of color, in this case, 
makes the 3s the one thing that stands out as distinct from the rest. 
Our brain is quick to pick up on this without our having to dedicate 
any conscious thought to it. 

This is remarkable. And profoundly powerful. It means that, if we 
use preattentive attributes strategically, they can help us enable our 
audience to see what we want them to see before they even 
know they're seeing it! 

Note the multiple preattentive attributes I've used in the preceding 
text to underscore its importance! 


Figure 4.4 shows the various preattentive attributes. 


Preattentive attributes signal where to look 
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Orientation 


Size 
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Shape Line length 
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Curvature Added marks 


Line width 


Enclosure 


Hue Intensity Spatial position Motion 

FIGURE 4.4 Preattentive attributes 

Source: Adapted from Stephen Few's Show Me the Numbers, 2004. 


Note as you scan across the attributes in Figure 4.4, youreye is drawn 
to the one element within each group that is different from the rest: 
you don't have to lookfor it. That's because our brains are hardwired 
to quickly pick up differences we see in our environment. 

One thing to be aware of is that people tend to associate quanti¬ 
tative values with some (but not all) of the preattentive attributes. 
For example, most people will consider a long line to represent a 
greater value than a short line. That is one of the reasons bar charts 
are straightforward for us to read. But we don't think of color in the 
same way. If I ask you which is greater—red or blue?—this isn't a 
meaningful question. This is important because it tells us which of 
the attributes can be used to encode quantitative information (line 
length, spatial position, or to a more limited extent, line width, size, 
and intensity can be used to reflect relative value), and which should 
be used as categorical differentiators. 

When used sparingly, preattentive attributes can be extremely useful 
for doing two things: (1) drawing your audience's attention quickly 
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to where you want them to look, and (2) creating a visual hierarchy 
of information. Let's look at examples of each of these, first with text 
and then in the context of data visualization. 


Preattentive attributes in text 

Without any visual cues, when we're confronted with a block of text, 
our only option is to read it. But preattentive attributes employed 
sparingly can quickly change this. Figure 4.5 shows how you can uti¬ 
lize some of the preattentive attributes introduced previously with 
text. The first block of text doesn't employ any preattentive attri¬ 
butes. This renders it similar to the count the 3s example: you have 
to read it, put on the lens of what's important or interesting, then 
possibly read it again to put the interesting parts back into the con¬ 
text of the rest. 

Observe how leveraging preattentive attributes changes the way 
you process the information. The subsequent blocks of text employ 
a single preattentive attribute each. Note how, within each, the pre¬ 
attentive attribute grabs your attention, and how some attributes 
draw your eyes with greater or weaker force than others (for exam¬ 
ple, color and size are attention grabbing, whereas italics achieve a 
milder emphasis). 


Preattentive attributes in text 
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No preattentive attributes 

What are we doing well? Great Products. These 
products are clearly the best in their class. 
Replacement parts are shipped when needed. You 
sent me gaskets without me having to ask. Problems 
are resolved promptly. Bev in the billing office was 
quick to resolve a billing issue I had. General 
customer service exceeds expectations. The 
account manager even called to check in after 
normal business hours. 

You have a great company - keep up the good work! 

Color 

What are we doing well? Great Products. These 
products are clearly the best in their class. 
Replacement parts are shipped when needed. You 
sent me gaskets without me having to ask. Problems 
are resolved promptly. Bev in the billing office was 
quick to resolve a billing issue I had. General 
customer service exceeds expectations. The 
account manager even called to check in after 
normal business hours. 

You have a great company - keep up the good work! 


Size 

What are we doing well? Great Products. These 
products are the best in their class. Replacement 
parts are shipped when needed. You sent gaskets 

without me having to 

ask . Problems are resolved promptly. Bev in the 
billing office was quick to resolve a billing issue I 
had. General customer service exceeds 
expectations. The account manager even called to 
check in after normal business hours. You have a 
great company - keep up the good work! 


Outline (enclosure) 

What are we doing well? Great Products. These 
products are clearly the best in their class. 
Replacement parts are shipped when needed. You 
sent me gaskets without me having to ask. Problems 
are resolved promptly. Bev in the billing office was 
quick to resolve a billing issue I had. General 
customer service exceeds expectations. The 
|account manager even called to check ih] after 
normal business hours. 

You have a great company - keep up the good work! 


Bold 

What are we doing well? Great Products. These 
products are clearly the best in their class. 
Replacement parts are shipped when needed. You 
sent me gaskets without me having to ask. Problems 
are resolved promptly. Bev in the billing office was 
quick to resolve a billing issue I had. General 
customer service exceeds expectations. The 
account manager even called to check in after 
normal business hours. 

You have a great company - keep up the good work! 

Italics 

What are we doing well? Great Products. These 
products are clearly the best in their class. 
Replacement parts are shipped when needed. You 
sent me gaskets without me having to ask. Problems 
are resolved promptly. Bev in the billing office was 
quick to resolve a billing issue I had. General 
customer service exceeds expectations. The 
account manager even called to check in after 
normal business hours. 

You have a great company - keep up the good work! 


Separate spatially 

What are we doing well? Great Products. These 
products are clearly the best in their class. 
Replacement parts are shipped when needed. You 
sent me gaskets without me having to ask. 

Problems are resolved promptly. 

Bev in the billing office was quick to resolve a billing 
issue I had. General customer service exceeds 
expectations. The account manager even called to 
check in after normal business hours. You have a 
great company - keep up the good work! 


Underline (added marks) 

What are we doing well? Great Products. These 
products are clearly the best in their class. 
Replacement parts are shipped when needed. You 
sent me gaskets without me having to ask. Problems 
are resolved promptly. Bev in the billing office was 
quick to resolve a billing issue I had. General 
customer service exceeds expectations. The 
account manager even called to check in after 
normal business hours. 

You have a great company - keep up the good work! 


FIGURE 4.5 Preattentive attributes in text 
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Beyond drawing our audience's attention to where we want them 
to focus it, we can employ preattentive attributes to create visual 
hierarchy in our communications. As we saw in Figure 4.5, the vari¬ 
ous attributes draw our attention with differing strength. In addition, 
there are variances within a given preattentive attribute that will draw 
attention with more or less strength. For example, with the preat¬ 
tentive attribute of color, a bright blue will typically draw attention 
more than a muted blue. Both will draw more attention than a light 
grey. We can leverage this variance and use multiple preattentive 
attributes together to make our visuals scannable, by emphasizing 
some components and de-emphasizing others. 

Figure 4.6 illustrates how this can be done with the block of text from 
the previous example. 

What are we doing well? 

Themes & example comments 

• Great products: "These products are clearly the best in class." 

• Replacement parts are shipped when needed: 

"You sent me gaskets without me having to ask, and I really 
needed them, too!" 

• Problems are resolved promptly: "Bev in the billing office was 
quick to resolve a billing issue I had." 

• General customer service exceeds expectations: 

"The account manager even called after normal business hours. 

You have a great company - keep up the good work!" 

FIGURE 4.6 Preattentive attributes can help create a visual hierarchy of 
information 

Preattentive attributes have been used in Figure 4.6 to create a 
visual hierarchy of information. This makes the information we pres¬ 
ent more easily scannable. Studies have shown that we have about 
3-8 seconds with our audience, during which time they decide 
whether to continue to look at what we've put in front of them or 
direct their attention to something else. If we've used our preattentive 
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attributes wisely, even if we only get that initial 3-8 seconds, we've 
given our audience the gist of what we want to say. 

Leveraging preattentive attributes to create a clear visual hierarchy 
of information establishes implicit instructions for your audience, 
indicating to them how to process the information. We can signal 
what is most important that they should pay attention to first, what 
is second most important that they should pay attention to next, and 
so on. We can push necessary but non-message-impacting compo¬ 
nents to the background so they don't compete for attention. This 
makes it both easier and faster for our audience to take in the infor¬ 
mation that we provide. 

The preceding example demonstrated the use of preattentive attri¬ 
butes in text. Preattentive attributes are also very useful for commu¬ 
nicating effectively with data. 


Preattentive attributes in graphs 

Graphs, without other visual cues, can become very much like the 
count the 3s exercise or the block of text we've considered previ¬ 
ously. Take the following example. Imagine you work for a car man¬ 
ufacturer. You are interested in understanding and sharing insight 
about the top design concerns (measured as the number of concerns 
per 1,000 concerns) from customers for a particular vehicle make 
and model. Your initial visual might look something like Figure 4.7. 
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Top 10 design concerns 

concerns per 1,000 


Engine power is less than expected 
Tires make excessive noise while driving 
Engine makes abnormal/excessive noise 
Seat material concerns 
Excessive wind noise 
Hesitation or delay when shifting 
Bluetooth system has poor sound quality 
Steering system/wheel has too much play 
Bluetooth system is difficult to use 
Front seat audio/entertainment/navigation controls 




FIGURE 4.7 Original graph, no preattentive attributes 


Note how, without other visual cues, you are left to process all of the 
information. With no clues about what's important or should be paid 
attention to, it's the count the 3s exercise all over again. 

Recall the distinction that was drawn early on in Chapter 1 between 
exploratory and explanatory analysis. The visual in Figure 4.7 could 
be one you create during the exploratory phase: when you're looking 
at the data to understand what might be interesting or noteworthy 
to communicate to someone else. Figure 4.7 shows us that there are 
ten design concerns that have more than eight concerns per 1,000. 

When it comes to explanatory analysis and leveraging this visual to 
share information with your audience (rather than just showing data), 
thoughtful use of color and text is one way we can focus the story, 
as illustrated in Figure 4.8. 
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7 of the top 10 design concerns have 10 or more concerns per 1,000. 

Discussion: is this an acceptable default rate? 


Top 10 design concerns 

concerns per 1,000 


Engine power is less than expected 
Tires make excessive noise while driving 
Engine makes abnormal/excessive noise 
Seat material concerns 
Excessive wind noise 
Hesitation or delay when shifting 
Bluetooth system has poor sound quality 
Steering system/wheel has too much play 
Bluetooth system is difficult to use 
Front seat audio/entertainment/navigation controls 


_ 12.9 

_103Bijj|| 


FIGURE 4.8 Leverage color to draw attention 


We can go one step further, using the same visual but with modi¬ 
fied focus and text to lead our audience from the macro to the micro 
parts of the story, as demonstrated in Figure 4.9. 


Of the top design concerns, three are noise-related. 


Top 10 design concerns 

concerns per 1,000 


Engine power is less than expected 


12.9 


Tires make excessive noise while driving 


12.3 


Comments indicate that 

noisy tire issues are 

most apparent in the rain. 


Engine makes abnormal/excessive noise 
Seat material concerns 
Excessive wind noise 


11.6 

11.6 


Complaints about engine 
noise commonly cited 

after the car had not 
been driven for a while. 


Hesitation or delay when shifting 


10.3 


Bluetooth system has poor sound quality 


10.0 


Steering system/wheel has too much play 


8.8 


Excessive wind noise is 
noted primarily in freeway 
driving at high speeds 


Bluetooth system is difficult to use 


8.6 


Front seat audio/entertainment/navigation controls 


8.2 


FIGURE 4.9 Create a visual hierarchy of information 


Especially in live presentation settings, repeated iterations of the 
same visual, with different pieces emphasized to tell different stories 
or different aspects of the same story (as demonstrated in Figures 4.7, 
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4.8, and 4.9), can be an effective strategy. This allows you to familiar¬ 
ize your audience with your data and visual first and then continue to 
leverage it in the manner illustrated. Note in this example how your 
eyes are drawn to the elements of the visual you're meant to focus 
on due to strategic use of preattentive attributes. 


Highlighting one aspect can make other things 
harder to see 


O ne word of warning in using preattentive attributes: 

when you highlight one point in your story, it can actu¬ 
ally make other points harder to see. When you're doing 
exploratory analysis, you should mostly avoid the use of pre¬ 
attentive attributes for this reason. When it comes to explan¬ 
atory analysis, however, you should have a specific story you 
are communicating to your audience. Leverage preattentive 
attributes to help make that story visually clear. 


The previous example used mainly color to draw the viewer's atten¬ 
tion. Let's look at another scenario using a different preattentive 
attribute. Recall the example introduced in Chapter 3: you manage 
an IT team and want to show how the volume of incoming tickets 
exceeds your team's resources. After decluttering the graph, we 
were left with Figure 4.10. 
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FIGURE 4.10 Let's revisit the ticket example 

In the process of determining where I want to focus my audience's 
attention, one strategy i'll often employ is to start by pushing every¬ 
thing to the background. This forces me to make explicit decisions 
regarding what to bring to the forefront or highlight. Let's start by 
doing this; see Figure 4.11. 




FIGURE 4.11 First, push everything to the background 


Received 

Processed 
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Next, I want to make the data stand out. Figure 4.12 shows both 
data series (Received and Processed) bolder and bigger than axis 
lines and labels. Itwas an intentional decision to make the Processed 
line darker than the Received line to draw emphasis to the fact that 
the number of tickets being processed has fallen below the num¬ 
ber being received. 



FIGURE 4.12 Make the data stand out 


Received 

Processed 


In this case, we want to draw our audience's attention to the right 
side of the graph, where the gap has started to form. Without other 
visual cues, our audience will typically start at the top left of our visual 
and do zigzagging "z's" with their eyes across the page. The viewer 
will eventually get to that gap on the right-hand side, but let's con¬ 
sider how we can use our preattentive attributes to make that hap¬ 
pen more quickly. 


The added marks of data points and numeric labels are one preat¬ 
tentive attribute we can leverage. Bear with me, though, as we take 
a step in the wrong direction before we go in the right one. See 
Figure 4.13. 
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FIGURE 4.13 Too many data labels feels cluttered 


When we add data markers and numeric labels to every data point, 
we quickly create a cluttered mess. But check out what happens in 
Figure 4.14 when we're strategic about which data markers and labels 
we preserve and which we eliminate. 



FIGURE 4.14 Data labels used sparingly help draw attention 


In Figure 4.14, the added marks act as a "look here" signal, drawing 
our audience's attention more quickly to the right side of the graph. 






116 


focus your audience's attention 


They provide for our audience the added benefit of allowing them to 
do some quick math in the event that they want to understand how 
big the backlog is becoming (if we think that is something they'd 
definitely want to do, we should consider doing it for them). 

These are just a couple of examples of using preattentive attributes 
to focus the audience's attention. We will look at a number of addi¬ 
tional examples that leverage this same broad strategy in different 
ways throughout the rest of this book. 

There are a few preattentive attributes that are so important from 
a strategic standpoint when it comes to focusing your audience's 
attention that they warrant their own specific discussions: size, color, 
and position on page. We'll address each of these in the following 
sections. 


Size 

Size matters. Relative size denotes relative importance. Keep this in 
mind when designing your visual communications. If you're show¬ 
ing multiple things that are of roughly equal importance, size them 
similarly. Alternatively, if there is one really important thing, leverage 
size to indicate that: make it BIG! 

The following is a real situation where size nearly caused unintended 
repercussions. 

Early in my career at Google, we were designing a dashboard to 
help with a decision-making process (I'm being intentionally vague 
to preserve confidentiality). In the design phase, there were three 
main pieces of information we knew we wanted to include, only 
one of which was readily available (the other data had to be chased 
after). In the initial versions of the dashboard, the information we 
had on hand took up probably 60% of the dashboard's real estate, 
with placeholders forthe other information we were collecting. After 
getting our hands on the other data, we plugged it into the existing 
placeholders. Rather late in the game, we realized that the size of 
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that initial data we had included was drawing undue attention com¬ 
pared to the rest of the information on the page. Luckily, we caught 
this before it was too late. We modified the layout to make the three 
equally important things the same size. It's interesting to think that 
completely different conversations may have been had and deci¬ 
sions reached as a result of this shift in design. 

This was an important lesson for me (and one that we'll highlight in 
the next section on color as well): don't let your design choices be 
happenstance; rather, they should be the result of explicit decisions. 


Color 

When used sparingly, color is one of the most powerful tools you have 
for drawing your audience's attention. Resist the urge to use color 
for the sake of being colorful; instead, leverage color selectively as 
a strategic tool to highlight the important parts of your visual. The 
use of color should always be an intentional decision. Never let your 
tool make this important decision for you! 

I typically design my visuals in shades of grey and pick a single 
bold color to draw attention where I want it. My base color is grey, 
not black, to allow for greater contrast since color stands out more 
against grey than black. For my attention-grabbing color, I often 
use blue for a number of reasons: (1) I like it, (2) you avoid issues of 
colorblindness that we'll discuss momentarily, and (3) it prints well 
in black-and-white. That said, blue is certainly not your only option 
(and you'll see many examples where I deviate from my typical blue 
for various reasons). 

When it comes to the use of color, there are several specific lessons 
to know: use it sparingly, use it consistently, design with the color¬ 
blind in mind, be thoughtful of the tone color conveys, and consider 
whetherto leverage brand colors. Let's discuss each of these in detail. 
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Use color sparingly 

It's easy to spot a hawk in a sky full of pigeons, but as the variety of 
birds increases, that hawk becomes harder and harder to locate. 
Remember the adage from Colin Ware that we discussed in the last 
chapter on clutter? The same principle applies here. For color to be 
effective, it must be used sparingly. Too much variety prevents any¬ 
thing from standing out. There needs to be sufficient contrast to 
make something draw your audience's attention. 

When we use too many colors together, beyond entering rainbow- 
land, we lose their preattentive value. By way of example, I once 
encountered a table that showed market rank for a handful of phar¬ 
maceutical drugs across a number of different countries, similar to 
the left-hand side of Figure 4.15. Each rank (1, 2, 3, and so on) was 
assigned its own color along a rainbow spectrum: 1 = red, 2 = orange, 
3 = yellow, 4 = light green, 5 = green, 6 = teal, 7 = blue, 8 = dark blue, 
9 = light purple, 10+ = purple. The cells within the table were filled 
with the color that corresponded to the numerical ranking. Rainbow 
Brite might have loved this table (forthose unfamiliar, a quickGoogle 
image search of Rainbow Brite will bring some understanding to this 
statement), but I was not a fan. The power of the preattentive attri¬ 
butes was lost: everything was different, which meant that nothing 
stood out. We were back to the count the 3s example—only worse, 
because the variance in colors was actually more distracting than 
helpful. A better alternative would be to use varying color satura¬ 
tion of a single color (a heatmap). 
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Country Level Sales Rank Top 5 Drugs Top 5 drugs: country-level sales rank 


Rainbow distribution in color indicates sales rank in 



FIGURE 4.15 Use color sparingly 


RANK 

1 

COUNTRY 1 DRUG 

A 

Australia 

1 

Brazil 

1 

Canada 

2 

China 

1 

France 

3 

Germany 

3 

India 

4 

Italy 

2 

Mexico 

1 

Russia 

4 

Spain 

2 

Turkey 

7 

United Kingdom 

1 

United States 

1 


2 3 


5+ 



Let's consider Figure 4.15. Where are your eyes drawn in the version 
on the left? Mine dart around quite a bit, trying to figure out what I 
should pay attention to. They hesitate on the dark purple, then red, 
then to the dark blue, probably because these have a higher satu¬ 
ration of color than the others. However, when we consider what 
these colors represent, it's not necessarily where we want our audi¬ 
ence to look. 

In the version on the right-hand side, varying saturation of a sin¬ 
gle color is used. Note that our perception is more limited when it 
comes to relative saturation, but one benefit we get is that it does 
carry with it some quantitative assumptions (that more heavily satu¬ 
rated represents greater value than less or vice versa—something 
you don't get with the rainbow colors used originally as categorical 
differentiators). This works well for our purpose here, where the low 
numbers (market leaders) are denoted with the highest color satu¬ 
ration. We are drawn to the dark blue first—the market leaders. This 
is a more thoughtful use of color. 
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Where are your eyes drawn? 


T here is an easy test for determining whether preattentive 
attributes are being used effectively. Create your visual, 
then close your eyes or look away for a moment and then 
look back at it, taking note of where your eyes are drawn first. 
Do they immediately land where you want your audience to 
focus? Better yet, seek the help of a friend or colleague—ask 
them to talk you through how they process the visual: where 
their eyes go first, where they go next, and so on. This is a 
great way to see things through your audience's eyes and 
confirm whether the visual you've created is drawing atten¬ 
tion and creating a visual hierarchy of information in the way 
that you desire. 


Use color consistently 

One question regularly raised in my workshops is around novelty. 
Does it make sense to change up the colors or graph types so the 
audience doesn't get bored? My answer is a resounding No! The 
story you are telling should be what keeps your audience's attention 
(we'll talk about story more in Chapter 7), notthe design elements of 
your graphs. When it comes to the type of graph, you should always 
use whatever will be easiest for your audience to read. When show¬ 
ing similar information that can be graphed the same way, there can 
be benefit to keeping the same layout as you essentially train your 
audience how to read the information, making the interpretation of 
later graphs all the easier and reducing mental fatigue. 

A change in colors signals just that—a change. So leverage this 
when you want your audience to feel change for some reason, 
but never simply for the sake of novelty. If you are designing your 
communication in shades of grey and using a single color to draw 
attention, leverage that same schematic throughout the communi¬ 
cation. Your audience quickly learns that blue, for example, signals 
where they are meant to look first, and can use this understanding 
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as they process subsequent slides or visuals. However, if you want 
to signal a clear change in topic or tone, a shift in color is one way 
to visually reinforce this. 

There are some cases where use of color must be consistent. Your 
audience will typically take time to familiarize themselves with what 
colors mean once and then will assume the same details apply 
throughout the rest of the communication. For example, if you are 
displaying data across four regions in a graph, each having their 
own color in one place within your presentation or report, be sure 
to preserve this same schematic throughout the visuals in the rest 
of your presentation or report (and avoid use of the same colors for 
other purposes if possible). Don't confuse your audience by chang¬ 
ing your use of color. 


Design with colorblind in mind 

Roughly 8% of men (including my husband and a former boss) and 
half a percent of women are colorblind. This most frequently mani¬ 
fests itself as difficulty in distinguishing between shades of red and 
shades of green. In general, you should avoid using shades of red 
and shades of green together. Sometimes, though, there is useful 
connotation that comes with using red and green: red to denote the 
double-digit loss you want to draw attention to or green to highlight 
significant growth. You can still leverage this, but make sure to have 
some additional visual cue to set the important numbers apart so you 
aren't inadvertently disenfranchising part of your audience. Consider 
also using bold, varying saturation or brightness, or adding a simple 
plus or minus sign in front of the numbers to ensure they stand out. 

When I'm designing a visual and selecting colors to highlight both 
positive and negative aspects, I frequently use blue to signal positive 
and orange for negative. I feel that positive and negative associations 
with these colors are still recognizable and you avoid the colorblind 
challenge described above. When you face this situation, consider 
whether you need to highlight both ends of the scale (positive and 
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negative) with color, or if drawing attention to one or the other (or 
sequentially, one and then the other) might work to tell your story. 


See your graphs and slides through colorblind eyes 


T here are a number of sites and applications with 
colorblindness simulators that allow you to see what 
your visual looks like through colorblind eyes. For example, 
Vischeck (vischeck.com) allows you to upload images 
or download the tool to use on your own computer. 

Color Oracle (colororacle.org) offers a free download for 
Windows, Linux, or Mac that applies a full-screen color 
filter independent of the software in use. CheckMyColours 
(checkmycolours.com) is a tool for checking foreground 
and background colors and determining if they provide 
sufficient contrast when viewed by someone having 
color-sight deficiency. 


Be thoughtful of tone that color conveys 

Color evokes emotion. Consider the tone you want to set with your 
data visualization or broader communication and choose a color 
(or colors) that help reinforce the emotion you want to arouse from 
your audience. Is the topic serious or lighthearted? Are you mak¬ 
ing a striking bold statement and want your colors to echo it, or is a 
more circumspect approach with a muted color-scheme appropriate? 

Let's discuss a couple specific examples of color and tone. I was once 
told by a client that the visuals I had made over looked "too nice" 
(as in friendly). I had created these particular visuals in my typical 
color palette: shades of grey with a medium blue used sparingly to 
draw attention. They were reporting the results of statistical analysis, 
and were used to and wanted a more clinical look and feel. Taking 
this into account, I reworked the visuals to leverage bold black to 
draw attention. I also swapped some of the title text for all capital 
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letters and changed the font throughout (we'll discuss font in more 
detail in Chapter 5 in the context of design). 

The resulting visuals, though at the core were exactly the same, had a 
completely different look and feel because of these simple changes. 
As with many of the other decisions we make when communicat¬ 
ing with data, the audience (in this case, my client) should be kept 
top of mind and their needs and desires considered when making 
design choices like these. 


Cultural color connotations 


W hen picking colors for communications to international 
audiences, it may be important to consider the conno¬ 
tations colors have in other cultures. David McCandless 
created a visualization showing colors and what they mean 
in different cultures, which can be found in his book The 
Visual Miscellaneum: A Colorful Guide to the World's 
Most Consequential Trivia (2012) or on his website at 
informationisbeautiful.net/visualizations/colours-in-cultures. 


As another example on color and tone, I recall flipping through an air¬ 
line magazine on a business trip and finding a fluffy article on online 
dating accompanied by graphs charting related data. The graphs 
were almost entirely hot pink and teal. Would you choose this color 
scheme for your quarterly business report? Certainly not. But given 
the nature and lively tone of the article these visuals accompanied, 
the peppy colors worked (and caught my attention!). 


Brand colors: to leverage or not to leverage? 

Some companies go through major undertakings to create their 
branding and associated color palette. There may be brand colors 
that you are required to work with or that make sense to leverage. 
The key to success when that is the case is to identify one or maybe 
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two brand-appropriate colors to use as your "audience-look-here" 
cues and keep the rest of your color palette relatively muted with 
shades of grey or black. 

In some cases, it may make sense to deviate from brand colors 
entirely. For example, I was once working with a client whose brand 
color was a light shade of green. I originally wanted to leverage this 
green as the standout color, but it simply wasn't attention grabbing 
enough. There wasn't sufficient contrast, so the visuals I created had 
a washed-out feel. When this is the case, you can use bold black to 
draw attention when everything else is in shades of grey, or choose 
an entirely different color—just make sure it doesn't clash with the 
brand colors if they need to be shown together (for example, if the 
brand logo will be on each page of the slide deck you are building). 
In this particular case, the client favored the version where I used 
an entirely different color. A sample of each of the approaches is 
shown in Figure 4.16. 


Leverage brand color 


Category 1 
Category 2 
Category 3 
Category 4 
Category 5 



Draw attention with black 


Category 1 
Category 2 
Category 3 
Category 4 
Category 5 



Use complementary color 

Category 1 
Category 2 
Category 3 
Category 4 
Category 5 



CtientLogo ClientLogo CUentLogo 

FIGURE 4.16 Color options with brand color 


In short: be thoughtful when it comes to your use of color! 


Position on page 

Without other visual cues, most members of your audience will start 
at the top left of your visual or slide and scan with their eyes in zig¬ 
zag motions across the screen or page. They see the top of the page 
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first, which makes this precious real estate. Think about putting the 
most important thing here (see Figure 4.17). 



FIGURE 4.17 The zigzag "z" of taking in information on a screen or page 

If something is important, try not to make your audience wade 
through other stuff to get to it. Eliminate this work by putting the 
important thing at the top. On a slide, these may be words (the main 
takeaway or call to action). In a data visualization, think about which 
data you want your audience to see first and whether rearranging the 
visual accordingly makes sense (it won't always, but this is one tool 
you have at your disposal for signaling importance to your audience). 

Aim to work within the way your audience takes in information, not 
against it. Here is an example of asking the audience to work against 
the way that comes naturally to them: I was once shown a process 
flow diagram that started at the bottom right and you were meant to 
read it upwards and to the left. This felt really uncomfortable (feel¬ 
ings of discomfort are something we should aim to avoid in our audi¬ 
ence!). All I wanted to do was read it from the top left to the bottom 
right, irrespective of the other visual cues that were present to try to 
encourage me to do the opposite. Another example I sometimes 
see in data visualization is something plotted on a scale ranging from 
negative to positive where the positive values are on the left (which 
is more typically associated with negative) and the negative values 
are on the right (which is more naturally associated with positive). 
Again, in this example, the information is organized in a way that is 
counter to the way the audience wants to take in the information, 
rendering the visual challenging to decipher. We'll look at a specific 
example related to this in case study 3 in Chapter 9. 
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Be mindful of how you position elements on a page and aim to do 
so in a way that will feel natural for your audience to consume. 


In closing 

Preattentive attributes are powerful tools when used sparingly and 
strategically in visual communication. Without other cues, our audi¬ 
ence is left to process all of the information we put in front of them. 
Ease this by leveraging preattentive attributes like size, color, and 
position on page to signal what's important. Use these strategic attri¬ 
butes to draw attention to where you want your audience to look 
and create visual hierarchy that helps guide your audience through 
the visual in the way you want. Evaluate the effectiveness of preat¬ 
tentive attributes in your visual by applying the "where are your eyes 
drawn?" test. 

With that, consider your fourth lesson learned. You now know howto 

focus your audience's attention where you want them to pay it. 
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Form follows function. This adage of product design has clear appli¬ 
cation to communicating with data. When it comes to the form and 
function of our data visualizations, we first want to think about what 
it is we want our audience to be able to do with the data (function) 
and then create a visualization (form) that will allow for this with 
ease. In this chapter, we will discuss how traditional design concepts 
can be applied to communicating with data. We will explore affor- 
dances, accessibility, and aesthetics, drawing on a number of con¬ 
cepts introduced previously, but looking at them through a slightly 
different lens. We will also discuss strategies for gaining audience 
acceptance of your visual designs. 

Designers know the fundamentals of good design but also how to 
trust their eye. You may think to yourself, But I'm not a designer! 
Stop thinking this way. You can recognize smart design. By becoming 
familiar with some common aspects and examples of great design, 
we will instill confidence in your visual instincts and learn some con¬ 
crete tips to follow and adjustments to make when things don't feel 
quite right. 
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Afford a nces 

In the field of design, experts speak of objects having "affordances." 
These are aspects inherent to the design that make it obvious how 
the product is to be used. For example, a knob affords turning, a 
button affords pushing, and a cord affords pulling. These character¬ 
istics suggest how the object is to be interacted with or operated. 
When sufficient affordances are present, good design fades into the 
background and you don't even notice it. 

For an example of affordances in action, let's look to the OXO brand. 
On their website, they articulate their distinguishing feature as "Uni¬ 
versal Design"—a philosophy of making products that are easy to 
use for the widest possible spectrum of users. Of particular relevance 
to our conversation here are their kitchen gadgets (which were once 
marketed as "tools you hold on to"). The gadgets are designed in 
such a way that there is really only one way to pick them up—the 
correct way. In this way, 0X0 kitchen gadgets afford correct use, 
without most users recognizing that this is due to thoughtful design 
(Figure 5.1). 



FIGURE 5.1 


0X0 kitchen gadgets 
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Let's consider how we can translate the concept of affordances to 
communicating with data. We can leverage visual affordances to 
indicate to our audience how to use and interact with our visualiza¬ 
tions. We'll discuss three specific lessons to this end: (1) highlight 
the important stuff, (2) eliminate distractions, and (3) create a clear 
hierarchy of information. 


Highlight the important stuff 

We've previously demonstrated the use of preattentive attributes to 
draw our audience's attention to where we want them to focus: in 
other words, to highlight the important stuff. Let's continue to explore 
this strategy. Critical here is to only highlight a fraction of the overall 
visual, since highlighting effects are diluted as the percentage that 
are highlighted increases. In Universal Principles of Design (Lidwell, 
Holden, and Butler, 2003), it is recommended that at most 10% of 
the visual design be highlighted. They offerthe following guidelines: 


• Bold, italics, and underlining : Use for titles, labels, captions, and 
short word sequences to differentiate elements. Bolding is gener¬ 
ally preferred over italics and underlining because it adds minimal 
noise to the design while clearly highlighting chosen elements. 
Italics add minimal noise, but also don't stand out as much and 
are less legible. Underlining adds noise and compromises legibil¬ 
ity, so should be used sparingly (if at all). 


• CASE and typeface: Uppercase text in short word sequences is 
easily scanned, which can work well when applied to titles, labels, 
and keywords. Avoid using different fonts as a highlighting tech¬ 
nique, as it's difficult to attain a noticeable difference without dis¬ 
rupting aesthetics. 


• Color is an effective highlighting technique when used sparingly 
and generally in concert with other highlighting techniques (for 
example, bold). 


Inversing elements 


| is effective at attracting attention, but can add 
considerable noise to a design so should be used sparingly. 

Size is another way to attract attention and signal importance. 
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I've omitted "blinking or flashing" from the list above, which Lidwell 
et al. include with instructions to use only to indicate highly critical 
information that requires immediate response. I do not recommend 
using blinking or flashing when communicating with data for explan¬ 
atory purposes (it tends to be more annoying than helpful). 

Note that preattentive attributes can be layered, so if you have some¬ 
thing really important, you can signal this and draw attention by mak¬ 
ing it large, colored, and bold. 

Let's look at a specific example using highlighting effectively in 
data visualization. A graph similar to Figure 5.2 was included in a 
February 2014 Pew Research Center article titled "New Census 
Data Show More Americans Are Tying the Knot, but Mostly It's the 
College-Educated." 


New Marriage Rate by Education 

Number of newly married adults per 1,000 marriage eligible adults 



All Less than High Some Bachelor's 

high school college degree or 

school graduate more 


Note: Marriage eligible includes the newly married plus those widowed, divorced, 
or never married at interview. 

Source: U.S. Census 

Adapted from PEW RESEARCH CENTER 

FIGURE 5.2 Pew Research Center original graph 
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Based on the article that accompanied it, Figure 5.2 is meant to 
demonstrate that the 2011 to 2012 increase observed in total new 
marriages was driven primarily by an increase in those having a bach¬ 
elor's degree or more (there doesn't actually appearto be an increase 
based on the "All" trend shown, but let's ignore this). The design of 
Figure 5.2 does little to draw this clearly to our attention, however. 
Rather, my attention is drawn to the 2012 bars within the various 
groups because they are rendered in a darker color than the rest. 

Changing the use of color in this visual can completely redirect our 
focus. See Figure 5.3. 


New Marriage Rate by Education 

Number of newly married adults per 1,000 marriage eligible adults 



All Less than High Some Bachelor's 

high school college degree or 

school graduate more 


Note: Marriage eligible includes the newly married plus those widowed, divorced, 
or never married at interview. 

Source: U.S. Census 

Adapted from PEW RESEARCH CENTER 
FIGURE 5.3 Highlight the important stuff 


In Figure 5.3, the color orange has been used to highlight the data 
points for those having a bachelor's degree or more. By making 
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everything else grey, the highlighting provides a clearsignal of where 
we should focus our attention. We'll come back to this example 
momentarily. 


Eliminate distractions 

While we highlight the important pieces, we also want to eliminate 
distractions. In his book Airman's Odyssey, Antoine de Saint-Exupery 
famously said, "You know you've achieved perfection, not when you 
have nothing more to add, but when you have nothing to take away" 
(Saint-Exupery, 1943). When it comes to the perfection of design with 
data visualization, the decision of what to cut or de-emphasize can 
be even more important than what to include or highlight. 

To identify distractions, think about both clutter and context. We've 
discussed clutter previously: these are elements that take up space 
but don't add information to our visuals. Context is what needs to 
be present for your audience in order for what you want to com¬ 
municate to make sense. When it comes to context, use the right 
amount—not too much, not too little. Consider broadly what infor¬ 
mation is critical and what is not. Identify unnecessary, extraneous, or 
irrelevant items or information. Determine whether there are things 
that might be distracting from your main message or point. All of 
these are candidates for elimination. 

Here are some specific considerations to help you identify poten¬ 
tial distractions: 

• Not all data are equally important. Use your space and audience's 
attention wisely by getting rid of noncritical data or components. 

• When detail isn't needed, summarize. You should be familiar 
with the detail, but that doesn't mean your audience needs to be. 
Consider whether summarizing is appropriate. 

• Ask yourself: would eliminating this change anything? No?Take 
it out! Resist the temptation to keep things because they are cute 
or because you worked hard to create them; if they don't support 
the message, they don't serve the purpose of communication. 
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• Push necessary, but non-message-impacting items to the back¬ 
ground. Use your knowledge of preattentive attributes to de- 
emphasize. Light grey works well for this. 

Each step in reduction and de-emphasis causes what remains to 
stand out more. In cases where you are unsure whether you'll need 
the detail thatyou're considering cutting, think about whetherthere 
is a way to include it without diluting your main message. For exam¬ 
ple, in a slide presentation, you can push content to the appendix 
so it's there if you need it but won't distract from your main point. 

Let's look back at the Pew Research example discussed previously. 
In Figure 5.3, we used color sparingly to highlightthe important part 
of our visual. We can further improve this graph by eliminating dis¬ 
tractions, as illustrated in Figure 5.4. 


New marriage rate by education 

Number of newly married adults per 1,000 marriage eligible adults 


Bachelor's degree or more 



57 


Some college 
High school grad 

Less than high school 



2008 2009 2010 2011 2012 


Note: Marriage eligible includes the newly married plus those widowed, divorced, 
or never married at interview. 

Source: U.S. Census 

Adapted from PEW RESEARCH CENTER 


FIGURE 5.4 Eliminate distractions 
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In Figure 5.4, a number of changes were made to eliminate distrac¬ 
tions. The biggest shift was from a bar graph to a line graph. As 
we've discussed, line graphs typically make it easier to see trends 
over time. This shift also has the effect of visually reducing discrete 
elements, because the data that was previously five bars has been 
reduced to a single line with the end points highlighted. When we 
consider the full data being plotted, we've gone from 25 bars to 4 
lines. The organization of the data as a line graph allows the use of a 
single x-axis that can be leveraged across all of the categories. This 
simplifies the processing of the information (rather than seeing the 
years in a legend at the left and then having to translate across the 
various groups of bars). 

The "All" category included in the original graph was removed alto¬ 
gether. This was the aggregate of all of the other categories, so 
showing it separately was redundant without adding value. This 
won't always be the case, but here it didn't add anything interest¬ 
ing to the story. 

The decimal points in the data labels were eliminated by rounding to 
the nearest whole digit. The data being plotted is "Numberof newly 
married adults per 1,000," and I find it strange to discuss the number 
of adults using decimal places (fractions of a person!). Additionally, 
the sheer size of the numbers and visible differences between them 
mean that we don't need the level of precision or granularity that 
decimal points provide. It is important to take context into account 
when making decisions like this. 

The italics in the subtitle were changed to regular font. There was 
no reason to draw attention to these words. In the original, I found 
that the spatial separation between the title and subtitle also caused 
undue attention to be placed on the subtitle, so I removed the spac¬ 
ing in the makeover. 

Finally, the highlighting of the "Bachelor's degree or more" category 
introduced in Figure 5.3 was preserved and extended to include 
the category name in addition to the data labels. As we've seen 
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previously, this is a way to tie components together visually for our 
audience, easing the interpretation. 

Figure 5.5 shows the before-and-after. 


New Marriage Rate by Education 

Number of newly married adults per 1,000 marriage eligible adults 


New marriage rate by education 

Number of newly married adults per 1,000 marriage eligible adults 


61.5 



All Less than High Some Bachelor's 

high school college degree or 

school graduate more 


Note: Marriage eligible includes the newly married plus those widowed, divorced, 
or never married at interview. 

Source: U.S. Census 

Adapted from PEW RESEARCH CENTER 


Bachelor's degree or more 62 



57 


Some college 
High school grad 

Less than high school 



37 

30 

23 


2008 2009 2010 2011 2012 


Note: Marriage eligible includes the newly married plus those widowed, divorced, 
or never married at interview. 

Source: U.S. Census 

Adapted from PEW RESEARCH CENTER 


FIGURE 5.5 Before-and-after 


By highlightingthe important stuff and eliminating distractions, we've 
markedly improved this visual. 


Create a clear visual hierarchy of information 

As we discussed in Chapter 4, the same preattentive attributes we 
use to highlight the important stuff can be leveraged to create a 
hierarchy of information. We can visually pull some items to the fore¬ 
front and push other elements to the background, indicating to our 
audience the general order in which they should process the infor¬ 
mation we are communicating. 
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The power of super-categories 


I n tables and graphs, it can sometimes be useful to lever¬ 
age super-categories to organize the data and help provide 
a construct for your audience to use to interpret it. For exam¬ 
ple, if you're looking at a table or graph that shows a value 
for 20 different demographic breakdowns, it can be useful to 
organize and clearly label the demographic breakdowns into 
groups or super-categories like age, race, income level, and 
education. These super-categories provide a hierarchical orga¬ 
nization that simplifies the process of taking in the information. 


Let's look at an example where a clear visual hierarchy has been estab¬ 
lished and discuss the specific design choices that were made to cre¬ 
ate it. Imagine you are a car manufacturer. Two important dimensions 
by which you judge the success of a particular make and model are 
(1) customer satisfaction and (2) frequency of car issues. A scatter- 
plot could be useful to visualize how the current year's models com¬ 
pare with the previous year's average along these two dimensions, 
as shown in Figure 5.6. 

Issues vs. Satisfaction by Model 

Satisfaction 

Things Gone LOW HIGH 

Wrong 

% satisfied or highly satisfied 
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FIGURE 5.6 Clear visual hierarchy of information 
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Figure 5.6 lets us quickly see how this year's various models compare 
to last year's average on the basis of both satisfaction and issues. The 
size and color of font and data points alert us where to pay atten¬ 
tion and in what general order. Let's consider the visual hierarchy 
of components and how they help us process the information pre¬ 
sented. If I articulate the order in which I take in the information, it 
looks something like the following: 

First, I read the graph title: "Issues vs. Satisfaction by Model." 
The bolding of Issues and Satisfaction signals that those words 
are important, so I have that context in mind as I process the rest 
of the visual. 

Next, I see the y-axis primary label: "Things Gone Wrong." I note 
that these fall along a scale from few (at the top) to many (at the 
bottom). After that, I note the details across the horizontal x-axis: 
Satisfaction, ranging from low (left) to high (right). 

I am then drawn to the dark grey point and corresponding words 
"Prior Year Average." The lines drawing this point to the axes allow 
me quickly to see that the prior year's average was around 900 
issues per 1,000 and 72% satisfied or highly satisfied. This provides 
a useful construct for interpreting this year's models. 

Finally, I am drawn to all of the red in the bottom right quadrant. 
The words tell me satisfaction is high, but there are many issues. 
It's clear because of how the visual is constructed that these are 
cases where the level of issues is greater than it was for last year's 
average. The red color reinforces that this is a problem. 

We previously discussed super-categories for easing interpretation. 
Flere, the quadrant labels "Pligh Satisfaction, Few Issues" and "High 
Satisfaction, Many Issues" function in this manner. In absence of 
these, I could spend time processing the axis titles and labels and 
eventually figure out that's what these quadrants represent, but it's 
a much easier process when the pithy titles are present, eliminating 
the need for this processing altogether. Note that the left quadrants 
aren't labeled; labels are unnecessary since no values fall there. 
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Additional data points and details are there for context, but they 
are pushed to the background to reduce the cognitive burden and 
simplify the visual. 

Upon sharing this visual with my husband, his reaction was "that's 
not the order I paid attention—I went straight to the red." That got 
me to thinking. First, I was surprised he started there, given that he's 
red-green colorblind, but he said that the red was different enough 
from everything else in the visual that it still grabbed his attention. 
Second, I look at so many graphs that it's ingrained in me to start 
with the details: the titles and axis titles to understand what I'm look¬ 
ing at before I get to the data. Others may look more quickly for the 
"so what." If we approach that way, we're drawn first to the bottom 
right quadrant since the red signals importance and that attention 
should be paid. After taking that in, perhaps we back up and read 
some of the other detail of the graph. 

In either case, the thoughtful and clear visual hierarchy establishes 
order for the audience to use to process the information in a com¬ 
plex visual without it feeling, well, complicated. For our audience, by 
highlighting the important stuff, eliminating distractions, and estab¬ 
lishing a visual hierarchy, the data visualizations we create afford 
understanding. 

Accessibility 

The concept of accessibility says that designs should be usable by 
people of diverse abilities. Originally, this consideration was forthose 
with disabilities, but over time the concept has grown more general, 
which is the way in which I'll discuss it here. Applied to data visual¬ 
ization, I think of it as design that is usable by people of widely vary¬ 
ing technical skills. You might be an engineer, but it shouldn't take 
someone with an engineering degree to understand your graph. 
As the designer, the onus is on you to make your graph accessible. 
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Poor design: who is at fault? 


W ell-designed data visualization—like a well-designed 
object—is easy to interpret and understand. When 
people have trouble understanding something, such as 
interpreting a graph, they tend to blame themselves. In most 
cases, however, this lack of understanding is not the user's 
fault; rather, it points to fault in the design. Good design 
takes planning and thought. Above all else, good design 
takes into account the needs of the user. This is another 
reminder to keep your user—your audience—top-of-mind 
when designing your communications with data. 


For an example of accessibility in design, let's consider the iconic 
London underground tube map. Harry Beck produced a beautifully 
simple design in 1933, recognizing that the above-ground geog¬ 
raphy is unimportant when navigating the lines and removing the 
constraints it imposed. Compared to previous tube maps, Beck's 
accessible design rendered an easy-to-follow visual that became an 
essential guide to London and a template for transport maps around 
the world. It is that same map, with some minor modifications, that 
still serves London today. 

We'll discuss two specific strategies related to accessibility in commu¬ 
nicating with data: (1) don't overcomplicate and (2) text is your friend. 


Don't overcomplicate 

"If it's hard to read, it's hard to do." This was the finding of research 
undertaken by Song and Schwarz at the University of Michigan in 
2008. First, they presented two groups of students with instructions 
for an exercise regimen. Half the students received the instructions 
written in easy-to-read Arial font; the other half were given instruc¬ 
tions in a cursive-like font called Brushstroke. Students were asked 
how long the exercise routine would take and how likely they were 
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to try it. The finding: the fussier the font, the more difficult the stu¬ 
dents judged the routine and the less likely they were to undertake 
it. A second study using a sushi recipe had similar findings. 

Translation for data visualization: the more complicated it looks, the 
more time your audience perceives it will take to understand and the 
less likely they are to spend time to understand it. 

As we've discussed, visual affordances can help in this area. Here 
are some additional tips to keep your visuals and communications 
from appearing overly complicated: 

• Make it legible: use a consistent, easy-to-read font (consider both 
typeface and size). 

• Keep it clean: make your data visualization approachable by lever¬ 
aging visual affordances. 

• Use straightforward language: choose simple language over 
complex, choose fewer words over more words, define any spe¬ 
cialized language with which your audience may not be familiar, 
and spell out acronyms (at minimum, the first time you use them 
or in a footnote). 

• Remove unnecessary complexity: when making a choice between 
simple and complicated, favor simple. 

This is not about oversimplifying, but rather not making things more 
complicated than they need to be. I once sat through a presentation 
given by a well-respected PhD. The guy was obviously smart. When 
he said his first five-syllable word, I found myself impressed with his 
vocabulary. But as his academic language continued, I started to 
lose patience. His explanations were unnecessarily complicated. His 
words were unnecessarily long. It took a lot of energy to pay attention. 

I found it hard to listen to what he was saying as my annoyance grew. 

Beyond annoying our audience by trying to sound smart, we run the 
risk of making our audience feel dumb. In either case, this is not a 
good user experience for our audience. Avoid this. If you find it hard 
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to determine whether you are overcomplicating things, seek input 
or feedback from a friend or colleague. 


Text is your friend 

Thoughtful use of text helps ensure that your data visualization is 
accessible. Text plays a numberof roles in communicating with data: 
use it to label, introduce, explain, reinforce, highlight, recommend, 
and tell a story. 

There are a few types of text that absolutely must be present. Assume 
that every chart needs a title and every axis needs a title (exceptions 
to this rule will be extremely rare). The absence of these titles—no 
matter how clear you think it may be from context—causes your audi¬ 
ence to stop and question what they are looking at. Instead, label 
explicitly so they can use their brainpowerto understand the informa¬ 
tion, rather than spend it trying to figure out how to read the visual. 

Don't assume that two different people looking at the same data 
visualization will draw the same conclusion. If there is a conclusion 
you want your audience to reach, state it in words. Leverage preat- 
tentive attributes to make those important words stand out. 


Action titles on slides 


T he title bar at the top of your PowerPoint slide is precious 
real estate: use it wisely! This is the first thing your audi¬ 
ence encounters on the page or screen and yet so often it 
gets used for redundant descriptive titles (for example, 

"2015 Budget"). Instead use this space for an action title. If 
you have a recommendation or something you want your 
audience to know or do, put it here (for example, "Estimated 
2015 spending is above budget"). It means your audience 
won't miss it and also works to set expectations for what will 
follow on the rest of the page or screen. 
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When it comes to words in data visualization, it can sometimes be 
useful to annotate important or interesting points directly on a graph. 
You can use annotation to explain nuances in the data, highlight 
something to pay attention to, or describe relevant external factors. 
One of my favorite examples of annotation in data visualization is 
Figure 5.7 by David McCandless, "Peak Break-up Times According 
to Facebook Status Updates." 


Peak Break-up Times 

According to Facebook status updates 



JAN FEB MAR APR MAY JUH JUL ftUG SEP OCT NOV DEC 

FIGURE 5.7 Words used wisely 

As we follow the annotations from left to right in Figure 5.7, we see 
a small increase on Valentine's Day, then large peaks in the weeks of 
Spring Break (cleverly subtitled "Spring clean?"). There's a spike on 
April Fool's Day. The trend of break-ups on Mondays is highlighted. 
A gentle rise and fall in break-ups is observed over summer holiday. 
Then we see a massive increase leading up to the holidays, but a 
sharp drop-off at Christmas, because clearly breaking up with some¬ 
one then would simply be "Too Cruel." 

Note how a few choice words and phrases make this data so much 
more quickly accessible than it otherwise would be. 
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As a side note, in Figure 5.7, the guidance I previously put forth 
about always titling the axes has not been followed. In this case, 
this is by design. Of more interest than the specific metric being 
plotted are the relative peaks and valleys. By not labeling the vertical 
axis (with either title or labels), you simply can't get caught up in a 
debate about it (What is being plotted? How is it being calculated? 
Do I agree with it?). This was a conscious design choice and won't 
be appropriate in most situations but, as we see in this case, 
can—in rare instances—work well. 

In the context of accessibility via text, let's revisit the ticket example 
we examined in Chapters 3 and 4. Figure 5.8 shows where we left 
off after eliminating clutter and drawing attention to where we want 
our audience to focus via data markers and labels. 


300 


250 - 


200 - 


150 - 


100 - 
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0 - 1 - 

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


FIGURE 5.8 Let's revisit the ticket example 


Figure 5.8 is a pretty picture, but it doesn't mean much without words 
to help us make sense of it. Figure 5.9 resolves this issue, adding 
the requisite text. 
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Ticket volume over time 



2014 

Data source: XYZ Dashboard, as of 12/31/2014 

FIGURE 5.9 Use words to make the graph accessible 

In Figure 5.9, we've added the words that have to be there: graph 
title, axis titles, and a footnote with the data source. In Figure 5.10, 
we take it a step further by adding a call to action and annotation. 

Please approve the hire of 2 FTEs 

to backfill those who quit in the past year 


Ticket volume over time 



2014 

Data source: XYZ Dashboard, as of 12/31/2014 I A detailed analysis on tickets processed per person 
and time to resolve issues was undertaken to inform this request and can be provided if needed. 


FIGURE 5.10 Add action title and annotation 
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In Figure 5.10, thoughtful use of text makes the design accessible. 
It's clear to the audience what they are looking at as well as what 
they should pay attention to and why. 


Aesthetics 

When it comes to communicating with data, is it really necessary to 
"make it pretty?" The answer is a resounding Yes. People perceive 
more aesthetic designs as easierto use than less aesthetic designs— 
whether they actually are or not. Studies have shown that more aes¬ 
thetic designs are not only perceived as easier to use, but also more 
readily accepted and used overtime, promote creative thinking and 
problem solving, and foster positive relationships, making people 
more tolerant of problems with designs. 

A great example of the tolerance with problems that good aesthet¬ 
ics can foster is a former bottle design of Method liquid dishwashing 
soap, pictured in Figure 5.11. The anthropomorphic form rendered 
the soap an art piece—something to be displayed, not hidden away 
under the counter. This bottle design was wildly effective in spite of 
leakage issues. People were willing to overlook the inconvenience 
of the leaking bottle due to its appealing aesthetics. 



FIGURE 5.11 Method liquid dishwashing soap 
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In data visualization—and communicating with data in general— 
spending time to make our designs aesthetically pleasing can mean 
our audience will have more patience with our visuals, increasing our 
chance of success for getting our message across. 

If you aren't confident in your ability to create aesthetic design, look 
for examples of effective data visualization to follow. When you see a 
graph that looks nice, pause to consider what you like about it. Per¬ 
haps save it and build a collection of inspiring visuals. Mimic aspects 
from effective designs to create your own. 

More specifically, let's discuss a few things to consider when it comes 
to aesthetic designs of data visualization. We've previously covered 
the main lessons relevant to aesthetics, so I'll touch on them here only 
briefly and then we'll discuss a specific example to see how being 
mindful of aesthetics can improve our data visualization. 

1 . Be smart with color. The use of color should always be an inten¬ 
tional decision; use color sparingly and strategically to highlight 
the important parts of your visual. 

2. Pay attention to alignment. Organize elements on the page 
to create clean vertical and horizontal lines to establish a sense 
of unity and cohesion. 

3. Leverage white space. Preserve margins; don't stretch your 
graphics to fill the space, or add things simply because you have 
extra space. 

Thoughtful use of color, alignment, and white space are components 
of the design that you don't even notice when they are done well. 
But you notice when they aren't: rainbow colors, and lacking align¬ 
ment and white space, make for a visual that's simply uncomfortable 
to look at. It feels disorganized and like no attention was paid to 
detail. This shows a lack of respect for your data and your audience. 

Let's look at an example: see Figure 5.12. Imagine you work for a 
prominent U.S. retailer. The graph depicts the breakdown of the 
U.S. Population and Our Customers by seven customer segments 
(for example, age ranges). 
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Distribution by customer segment 


Segment 7 

Segment 6 

Segment 5 
Segment 4 
Segment 3 
Segment 2 

Segment 1 


15 % 


US Population 

FIGURE 5.12 Unaesthetic design 


11 % 

20 % 



Our Customers 


We can leverage the lessons covered to make smarter design choices. 
Specifically, let's discuss how we can improve Figure 5.12 when it 
comes to the use of color, alignment, and white space. 

Color is overused. There are too many colors, and they compete for 
our attention, making it difficult to focus on one at a time. Going 
back to the lesson on affordances, we should think about what 
we want to highlight to our audience and only use color there. In 
this case, the red box around segments 3 through 5 on the right 
signals that those segments are important, but there are so many 
things competing for our attention that it takes some time to even 
see that. We can make this a more obvious and easier process by 
using color strategically. 

Elements are not properly aligned. The center alignment of the graph 
title makes it so it isn't aligned with anything else in the visual. The 
segment titles at the left aren't aligned to create a clean line either 
on the left or right. This looks sloppy. 

















148 


think like a designer 


Finally, white space is misused. There is too much of it between the 
segment titles and data, which makes itchallengingtodrawyoureye 
from the segment title to the data (I have an urge to use my index 
finger to trace across: we can reduce white space between the titles 
and data, so this work is unnecessary). The white space between the 
columns of data is too narrow to optimally emphasize the data and 
cluttered with unneeded dotted lines. 

Figure 5.13 shows how the same information could look if we rem¬ 
edy these design issues. 


Distribution by customer segment 



FIGURE 5.13 Aesthetic design 


Aren't you more likely to spend a little more time with Figure 5.13? 
It is clear that attention to detail was paid to the design: it took the 
designer time to get this result. This creates a sort of onus on the 
part of the audience to spend time to understand it (this sort of con¬ 
tract doesn't exist with poor design). Being smart with color, aligning 
objects, and leveraging white space brings a sense of visual organi¬ 
zation to your design. This attention to aesthetics shows a general 
respect for your work and your audience. 
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Acceptance 

For a design to be effective, it must be accepted by its intended 
audience. This adage is true whether the design in question is that 
of a physical object or a data visualization. But what should you do 
when your audience isn't accepting of your design? 

In my workshops, audience members regularly raise this dilemma: 

I want to improve the way we look at things, but when I've tried to 
make changes in the past, my efforts have been met with resistance. 
People are used to seeing things a certain way and don't want us 
to mess with that. 

It is a fact of human nature that most people experience some level 
of discomfort with change. Lidwell et al. in Universal Principles of 
Design (2010) describe this tendency of general audiences to resist 
the new because of their familiarity with the old. Because of this, 
making significant changes to "the way we've always done it" may 
require more work to gain acceptance than simply replacing the old 
with the new. 

There are a few strategies you can leverage for gaining acceptance 
in the design of your data visualization: 

• Articulate the benefits of the new or different approach. Some¬ 
times simply giving people transparency into why things will look 
different going forward can help them feel more comfortable. Are 
there new or improved observations you can make by looking at 
the data in a different way? Or other benefits you can articulate to 
help convince your audience to be open to the change? 

• Show the side-by-side. If the new approach is clearly superior to 
the old, showing them side-by-side will demonstrate this. Couple 
this with the prior approach by showing the before-and-after and 
explaining why you want to shift the way you're looking at things. 

• Provide multiple options and seek input. Ratherthan prescribing 
the design, consider creating several options and getting feedback 
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from colleagues or your audience (if appropriate) to determine 
which design will best meet the given needs. 

• Get a vocal member of your audience on board. Identify influ¬ 
ential members of your audience and talk to them one-on-one in 
an effortto gain acceptance of your design. Askfortheirfeedback 
and incorporate it. If you can get one or a couple of vocal mem¬ 
bers of your audience bought in, others may follow. 

One thing to consider if you find yourself met with resistance is whether 
the root problem is that your audience is slow to change or if there 
might be issues with the design you are proposing. Test this by get¬ 
ting input from someone who doesn't have a vested interest. Show 
them your data visualization. If appropriate, also show the historical 
or current visuals. Have them talk you through their thought process 
as they review the visual. What do they like? What questions do they 
have? Which visual do they prefer and why? Hearing these things from 
a nonbiased third party may help you uncover issues with your design 
that are leading to the adoption challenge you face with your audi¬ 
ence. The conversation may also help you articulate talking points 
that will help you drive the acceptance you seek from your audience. 


In closing 

By understanding and employing some traditional design concepts, 
we set ourselves up for success in communicating with data. Offer 
your audience visual affordances as cues for how to interact with 
yourcommunication: highlightthe important stuff, eliminate distrac¬ 
tions, and create a visual hierarchy of information. Make your designs 
accessible by not overcomplicating and by leveraging text to label 
and explain. Increase your audience's tolerance of design issues by 
making your visuals aesthetically pleasing. Employ the strategies 
discussed for gaining audience acceptance for your visual designs. 

Congratulations! You now know the 5th lesson in storytelling with 
data: how to think like a designer. 


chapter six 


dissecting model 

visuals 


Up to this point, we've covered a number of lessons you can employ 
to improve your ability to communicate with data. Now that you 
understand the basics of what makes a visual effective, let's consider 
some additional examples of what "good" data visualization looks 
like. Before covering our final lesson, in this chapter we will look at 
several model visuals and discuss the thought process and design 
choices that led to their creation, utilizing the lessons we've covered. 

You'll notice some similarconsiderations being made across the vari¬ 
ous examples. When creating each example, I thought about how I 
want the audience to process the information and made correspond¬ 
ing choices regarding what to emphasize and draw the audience's 
attention to as well as what to de-emphasize. Because of this, you 
will see common points raised around color and size. The choice of 
visual, relative ordering of data, alignment and positioning of ele¬ 
ments, and use of words are also discussed in a number of cases. 
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This repetition is useful to reinforce the concepts I'm thinking about 
and resulting design choices across the various examples. 

Each visual highlighted was created to meet the need of a specific 
situation. I'll discuss the relevant scenarios briefly, but don't worry 
too much about the details. Rather, spend time looking at and think¬ 
ing about each model visual. Consider what data visualization chal¬ 
lenges you face where the given approach (or aspects of the given 
approach) could be leveraged. 

Model visual #1: line graph 
Annual giving campaign progress 



FIGURE 6.1 Line graph 

Company X runs an annual month-long "giving campaign" to raise 
money for charitable causes. Figure 6.1 shows this year's progress to 
date. Let's consider what makes this example good and the deliber¬ 
ate choices made in the course of its creation. 
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Words are used appropriately. Everything is titled and labeled, so 
there's no question about what we are looking at. Graph title, vertical 
axis title, and horizontal axis title are present. The various lines in the 
graph are labeled directly, so there's no work going back and forth 
between a legend and the data to decipher what is being graphed. 
Good use of text makes this visual accessible. 

If we apply the "where are your eyes drawn?" test described in Chap¬ 
ter 4, I briefly scan the graph title, then I'm drawn to the "Progress to 
date" trend (where we want the audience to focus). I almost always 
use dark grey for the graph title. This ensures that it stands out, but 
without the sharp contrast you get from pure black on white (rather, 

I preserve the use of black for a standout color when I'm not using 
any other colors). A number of preattentive attributes are employed 
to draw attention to the "Progress to date" trend: color, thickness of 
line, presence of data marker and label on the final point, and the 
size of the corresponding text. 

When it comes to the broader context, a couple of points for compar¬ 
ison are included but de-emphasized so the graph doesn't become 
visually overwhelming. The goal of $50,000 is drawn on the graph for 
reference, but is pushed to the background by use of a thin line; both 
the line and text are the same grey as the rest of the graph details. Last 
year's giving over time is included but also de-emphasized through the 
use of a thinner line and lighter blue (to tie it visually with this year's 
progress, but without competing for attention). 

A couple of deliberate decisions were made regarding axis labels. 
On the vertical y-axis, you could consider rounding the numbers to 
thousands—so the axis would range from $0 to $60 and the axis title 
would be changed to "Money raised (thousands of dollars)." If the 
numbers were on the scale of millions, I probably would have done 
this. For me, however, thinking about numbers in the thousands isn't 
as intuitive, so rather than mess with the scale here, I preserved the 
zeros in the y-axis labels. 

On the horizontal x-axis, we don't need every single day labeled since 
we're more interested in the overall trend, not what happened on a 
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specific day. Because we have data through the 10th day of a 30-day 
month, I chose to label every 5th day on the x-axis (given that this is 
days we're talking about, another potential solution would be to label 
every 7th day and/or add super-categories of week 1, week 2, etc.). 
This is one of those cases where there isn't a single right answer: you 
should think about the context, the data, and how you want your 
audience to use the visual and make a deliberate decision in light 
of those things. 


Model visual #2: annotated line graph with 
forecast 


Sales over time 


$180 
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2011 -14: 2015 & beyond: assumed 
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of 8-9% annually $144,,' 

$131 

$119 

$108 
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ACTUAL 

FORECAST 


Data source: Sales Dashboard; annual figures are as of 12/31 of the given year. 

*Use this footnote to explain what is driving the 10% annual growth forecast assumption. 

FIGURE 6.2 Annotated line graph with forecast 
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Figure 6.2 shows an annotated line graph of actual and forecast 
annual sales. 

I often see forecast and actual data plotted together as a single line, 
without any distinguishing aspects to set the forecast numbers apart 
from the rest. This is a mistake. We can leverage visual cues to draw 
a distinction between the actual and forecast data, easing the inter¬ 
pretation of the information. In Figure 6.2, the solid line represents 
actual data and a thinner dotted line (which carries some connotation 
of less certainty than a solid, bold line) represents the forecast data. 
Clear labeling of Actual and Forecast under the x-axis helps rein¬ 
force this (written in all caps for easy scanning), with the forecast por¬ 
tion set apart visually ever so slightly via light background shading. 

In this visual, everything has been pushed to the background through 
the use of grey font and elements exceptthe graph title, dates within 
the text boxes, data (line), select data markers, and numeric data 
labels from 2014 forward. When we consider the visual hierarchy of 
elements, my eye goes first to the graph title at the top left (due to 
both position and the preattentive larger dark grey text discussed in 
the prior example), then to the blue dates in the text boxes, at which 
point I can pause and read for a little context before moving my eye 
downward to see the corresponding point or trend in the data. Data 
markers are included only for those points referenced in the anno¬ 
tation, making it a quick process to see what part of the data is rel¬ 
evant to which annotation. (Originally, the data markers were solid 
blue, but I changed to white with blue outline, which made them 
stand out a little more in a way that I liked; the forecast data mark¬ 
ers are smaller and solid blue, because white with blue outline there 
looked overly cluttered against the dotted lines.) 

The $108 numeric label is bold. This is emphasized intentionally, since 
it is the last point of actual data and the anchorforthe forecast. His¬ 
torical data points are not labeled. Instead, the y-axis is preserved 
to give a general sense of magnitude, since we want the audience 
to focus on relative trends rather than precise values. Numeric data 
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labels are included for the forecast data points to give the audience 
a clear understanding of forward-looking expectations. 

All text in the visual is the same size except where intentional deci¬ 
sions were made to change it. The graph title is larger. The footnote 
is de-emphasized via smaller font and a low-priority placement at 
the bottom of the visual so that it is there to aid interpretation as 
needed, but doesn't draw attention. 

Model visual #3: 100% stacked bars 


Goal attainment over time 


■ Miss ■ Meet ■ Exceed 


As of Q3 2015, more than 1/3 of 
projects are missing goals 



Data source: XYZ Dashboard; the total number of projects has increased over time from 230 in early 2013 to nearly 270 in Q3 2015. 

FIGURE 6.3 100% stacked bars 


The stacked bar chart in Figure 6.3 is an example visual from the 
consulting world. Each consulting project has specific goals associ¬ 
ated with it. Progress against those goals is assessed quarterly and 
designated as "Miss," "Meet," or "Exceed." The stacked bar chart 
shows the percentage of total projects in each of these categories 
over time. As with prior examples, don't worry too much about the 
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details here; instead, reflect on what can be learned from the design 
considerations that went into creating this data visualization. 

Let's first consider the alignment of objects within this visual. The 
graph title, legend, and vertical y-axis title are all aligned in the 
upper-left-most position. This means our audience encounters how 
to read the graph before they get to the data. On the left-hand side, 
the graph title, legend, y-axis title, and footnote are all aligned, cre¬ 
ating a clean line on the left side of the visual. On the right-hand 
side, the text at the top is right-justified and aligned with the final 
bar of data that contains the data point being described (leverag¬ 
ing the Gestalt principle of proximity). This same text box is aligned 
vertically with the graph legend. 

When it comes to focusing the audience's attention, red is used as 
the single attention-grabbing color (primary red tends to be too 
loud for me, so I often opt instead for a burnt-red shade as I did 
here). Everything else is grey. Numeric data labels were used—an 
additional visual cue signaling importance given the stark contrast 
of white on red and large text—on the points we want the audience 
to focus: the increasing percentage of projects missing goals. The 
rest of the data is preserved for context, but pushed to the back¬ 
ground so it doesn't compete for attention. Slightly different shades 
of grey were used so you can still focus on one or the other series 
of data at a time, but it doesn't distract from the clear emphasis on 
the red series. 

The categories fall along a scale from "Miss" to "Exceed," and this 
ordering is leveraged from bottom to top within the stacked bars. 
The "Miss" category is closest to the x-axis, making change over 
time easy to see because of the alignment of the bars at the same 
starting point (the x-axis). Change over time in the "Exceed" cate¬ 
gory is also easy because of the consistent alignment along the top 
of the graph. The change over time in the percentage of projects 
that meet their goals is harder to see because there is no consistent 
baseline at eitherthe top or bottom of the graph, but given thatthis 
is a lower-priority comparison, this is OK. 


# of directors 
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Words make the visual accessible. The graph has a title, the y-axis 
has a title, and the x-axis leverages super-categories (years) to reduce 
redundant labeling and make the data more easily scannable. The 
words at the top right reinforce what we should be paying attention 
to (we will talk about words much more in the context of storytelling 
in Chapter 7). The footnote contains a note about the total number 
of projects over time, which is helpful context that we don't get from 
the visual directly due to the use of 100% stacked bars. 


Model visual #4: leveraging positive and negative 
stacked bars 


Expected director population overtime 


250 - 


200 

150 

100 

50 
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-50 


-100 



Today 

9/30/15 



Unmet need (gap) 


Directors from acquisitions 
Promotions to director 
Today's directors 


Attrition 


A footnote explaining relevant forecast assumptions and methodology would go here. 

FIGURE 6.4 Leveraging positive and negative stacked bars 

Figure 6.4 shows an example from the people analytics space. It can 
be useful to look forward to understand expected needs for senior 
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talent and identify any gaps so they can be proactively addressed. 
In this example, there will be increasing unmet need for directors 
given assumptions for expected additions to the director pool over 
time through acquisitions and promotions and the decrease to the 
pool over time due to attrition (directors leaving the company). 

If we consider the path our eyes take with Figure 6.4, mine scan the 
title, then go directly to the big, bold, blacknumbers andfollowthem 
to the right to the text that tells me this represents "Unmet need 
(gap)." My eye then goes downward, reading the text and glancing 
back leftward to the data each describes, until I hit the final series, 
"Attrition," atthe bottom. Atthis point, my eyes sort of bounce back 
and forth between "Attrition" and "Unmet need (gap)" portions of 
the bars, noting that there is some increase in the total number of 
directors over time as we look left to right (likely as the overall com¬ 
pany grows and the need for senior leaders increases as a result), 
but that the majority of the unmet need is due to attrition of the cur¬ 
rent director pool. 

Intentional choices were made when it comes to the use of color 
throughout this visual. "Today's directors" are shown in my stan¬ 
dard medium blue. The exiting directors ("Attrition") are shown in 
a less saturated version of the same color to tie these together visu¬ 
ally. Overtime, you see less of the blue falling above the axis and an 
increasing proportion falling below the axis as more and more direc¬ 
tors attrite. The negative direction of the "Attrition" series reinforces 
that this volume represents a decrease to the director pool. Direc¬ 
tors added through acquisitions and promotions are shown in green 
(which carries positive connotation). The unmet need is depicted by 
an outline only, to visually show empty space, reinforcing that this 
represents a gap. The text labels on the right are each written in the 
same color as the given data series they describe, except "Unmet 
need (gap)," which is written in the same big, bold, black text as the 
data labels for this series. 

The ordering of the various data series within the stacked bars is 
deliberate. "Today's directors" is the base, and as such is shown 
beginning at the horizontal axis. As I mentioned previously, the 
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negative "Attrition" series falls below that in a negative direction. 
Above "Today's directors" are the additions: promotions and acqui¬ 
sitions. Finally, at the top (where our eye hits sooner than the subse¬ 
quent data), we encounter the "Unmet need (gap)." 

The y-axis is preserved so the reader has a sense of total magnitude 
(both in the positive and negative direction), but it is pushed to the 
background via grey text. Only those specific points we should pay 
attention to—the "Unmet need (gap)"—are labeled directly with 
numerical values. 

All text in the visual is the same size except where decisions were 
made to further emphasize or de-emphasize components. The graph 
title is larger. The axis title "# of directors" is slightly larger to ease 
the reading of the rotated text. The "Unmet need (gap)" text and 
numbers are bigger and bolder than anything else in the visual, as 
this is where we want the reader to pay attention. The footnote is 
written in smaller text, so it is there as needed but does not draw 
attention. By making it grey and in the lowest-priority position at the 
bottom of the visual, we further de-emphasize the footnote. 


Model visual #5: horizontal stacked bars 
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Top 15 development priorities, according to survey 

PRIORITY TOTAL % 


Education 

45% 

Agriculture & rural development 

37% 

Poverty reduction 

32% 

Reconstruction 

18% 

Economic growth 

17% 

Health 

16% 

Job creation 

15% 

Governanace 

14% 

Anti-corruption 

14% 

Transport 

12 % 

Energy 

11 % 

Law & Justice 

9% 

Basic infrastructure 

8 % 

Public sector reform 

8 % 

Public financial management 

7% 


Most important I 2nd Most Important I 3rd Most Important 


24% 14% 

17% 12% 

15% 10% BE 

9% 5% 

7% 5% 



6% 


5% 


4% 


4% 4% 

4% 4% 

3% 14% 


4°„ 
3 3% 
3 3% 

r*f. 3 3 % 


N = 4,392. Based on responses to item, When considering development priorities, which one development priority is the most important? Which one is 
the second most important priority? Which one is the third most important priority? Respondents chose from a list. Top 15 shown. 

FIGURE 6.5 Horizontal stacked bars 


Figure 6.5 shows the results of survey questions on relative priori¬ 
ties in a developing nation. This is a great deal of information, but 
due to strategic emphasizing and de-emphasizing of components, 
it does not become visually overwhelming. 

Stacked bars make sense here given the nature of what is being 
graphed: top priority (in first position in the darkest shade), 2nd pri¬ 
ority (in second position and a slightly lighter shade of the same 
color), and 3rd priority (in third position and an even lighter shade of 
the same color). Orienting the chart horizontally means the category 
names along the left are easy to read in horizontal text. 

The categories are organized vertically in descending order of "Total 
%," giving the audience a clear construct to use as they interpret the 
data. The biggest categories are at the top, so we see them first. 
The top three priorities are emphasized specifically through the use 
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of color (the narrative that accompanied the original version of this 
visual focused on these). This color is leveraged for the category 
name, total % and stacked bars of data. This consistent color ties 
the components together visually. 

One decision point when graphing data is whether to preserve the 
axis, label the data points (or some data points) directly, or both. In 
this case, the numeric data labels within the bars have been pre¬ 
served, but de-emphasized with smaller text (oriented to the left, 
which creates a clean line as you scan down the data labels for the 
"Most important," making it feel slightly less cluttered than right- or 
center-oriented text that would vary in position across each of the 
bars). The data labels were further de-emphasized through the color 
they are written in: a light shade of blue or grey that doesn't create 
as stark a contrast as white labels on a colored bar. The x-axis was 
eliminated altogether. Here, we implicitly assume that the specific 
values are important enough to label. Another scenario may call for 
a different approach. 

As we noted with a number of the previous examples, words are used 
well in this visual. Everything is titled and labeled. The titles "Prior¬ 
ity" and "Total %" are written all in caps for easy scanning. The leg¬ 
end for the interpretation of the bars appears immediately above 
the first bar of data with the keywords "Most," "2nd," and "3rd" 
bolded for emphasis. Additional detail is described in the footnote. 


In closing 

We can learn by examining effective visual displays and consider¬ 
ing the design choices that were made to create them. Through 
the examples in this chapter, we've reinforced a number of the les¬ 
sons covered up to this point. We touched on the choice of graph 
type and ordering of data. We considered where our eyes are drawn 
and in what order due to strategies employed to emphasize and 
de-emphasize components through the use of color, thickness, and 
size. We discussed the alignment and positioning of elements. We 
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considered the appropriate use of text that makes the visuals acces¬ 
sible through clear titling, labeling, and annotation. 

There is something to be learned from every example of data visu¬ 
alization you encounter—both good and bad. When you see some¬ 
thing you like, pause to consider why. Those who follow my blog 
(storytellingwithdata.com) might be aware that I'm also an avid cook, 
and I often parlay the following food metaphor into data analytics: 
in data visualization, there is rarely (if ever) a single "right" answer; 
rather, there are flavors of good. The examples we've looked at in 
this chapter are the haute cuisine of charts. 

That said, different people will make different decisions when faced 
with the same data visualization challenge. Because of this, inevita¬ 
bly I've made some design choices in these visuals that you might 
have handled differently. That's OK. I hope by articulating my thought 
process that you can understand why I made the design choices I 
did. These are considerations to keep in mind in your own design 
process. Of primary importance is that your design choices be just 
that: intentional. 

Now you're ready for the final storytelling with data lesson: tell a 

story. 



chapter seven 


lessons in storytelling 


In my workshops, the lesson on storytelling often begins with a 
thought exercise. I ask participants to close their eyes and recall 
the story of Red Riding Hood, considering specifically the plot, the 
twists, and the ending. This exercise sometimes generates some 
laughs; people wonder about its relevance or gamely confuse it with 
Three Little Pigs. But I find that the majority of participants (typically 
around 80-90% based on a show of hands) are able to remember 
the high-level story—often a modified version of Grimms' maca¬ 
bre original. 

Indulge me for a moment, while I tell you the version that resides 
in my head: 

Grandma has fallen ill and Red Riding Hood sets out on a walk 
through the woods with a basket of goodies to deliver to her. On her 
way, she encounters a woodsman and a wolf. The wolf runs ahead, 
eats Grandma, and dresses up in her clothes. When Red arrives, she 
senses something is awry. She goes through a series of questions 
with the wolf (posing as Grandma), culminating in the observation: 
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"Oh, Grandma, how big your teeth are!"—to which the wolf replies, 
"The better to eat you with!" and swallows Red whole. The woods¬ 
man walks by, and, seeing the door to Grandma's house ajar, decides 
to investigate. Inside, he finds the wolf dozing after his meal. The 
woodsman suspects what has happened and chops the wolf in half. 
Grandma and Red Riding Hood emerge—safe and soundI It is a 
happy ending for everyone (except the wolf). 

Now let's turn back to the question that may be on the tip of your 
tongue: What could Red Riding Hood possibly have to do with com¬ 
municating with data? 

Forme, this exercise is evidence of a couple of things. First is the power 
of repetition. You likely have heard some version of Red Riding Hood 
a number of times. Perhaps you've read or told a version of the story 
a number of times. This process of hearing, reading, and saying things 
numerous times helps to cement them in our long-term memory. Sec¬ 
ond, stories like Red Riding Hood employ this magical combination of 
plot-twists-ending (or, as we'll learn momentarily from Aristotle—begin¬ 
ning, middle, and end), which works to embed things in our memory 
in a way that we can later recall and retell the story to someone else. 

In this chapter, we explore the magic of story and how we can use 
concepts of storytelling to communicate effectively with data. 


The magic of story 

When you see a great play, watch a captivating movie, or read a fan¬ 
tastic book, you've experienced the magic of story. A good story 
grabs your attention and takes you on a journey, evoking an emo¬ 
tional response. In the middle of it, you find yourself not wanting to 
turn away or put it down. After finishing it—a day, a week, or even a 
month later—you could easily describe it to a friend. 

Wouldn't it be great if we could ignite such energy and emotion in our 
audiences? Story is a time-tested structure; humans have been commu¬ 
nicating with stories throughout history. We can leverage this powerful 
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tool for our business communications. Let's look to the art forms of 
plays, movies, and books to understand what we can learn from mas¬ 
ter storytellers that will help us better tell our own stories with data. 

Storytelling in plays 

The notion of narrative structure was first described in ancient times 
by Greek philosophers such as Aristotle and Plato. Aristotle intro¬ 
duced a basic but profound idea: that story has a clear beginning, 
middle, and end. He proposed a three-act structure for plays. This 
concept has been refined over time and is commonly referred to as 
the setup, conflict, and resolution. Let's look briefly at each of these 
acts and what they contain, and then we'll consider what we can 
learn from this approach. 

The first act sets up the story. It introduces the main character, or pro¬ 
tagonist, their relationships, and the world in which they live. After 
this setup, the main character is confronted with an incident. The 
attempt to deal with this incident typically leads to a more dramatic 
situation. This is known as the first turning point. The first turning 
point ensures that life will never be the same for the main charac¬ 
ter and raises the dramatic question—framed in terms of the main 
character's call to action—to be answered in the climax of the play. 
This marks the end of the first act. 

The second act makes up the bulk of the story. It depicts the main 
character's attempt to resolve the problem created through the first 
turning point. Often, the main character lacks the skills to deal with 
the problem he faces, and, as a result, finds himself encountering 
increasingly worsening situations. This is known as the character arc, 
where the main character goes through major changes in his life as a 
result of what is happening. He may have to learn new skills or reach 
a higher sense of awareness of who he is and what he is capable of 
in order to deal with his situation. 

The third act resolves the story and its subplots. It includes a climax, 
where the tensions of the story reach the highest point of intensity. 
Finally, the dramatic question introduced in the first act is answered, 
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leaving the protagonist and other characters with a new sense of 
who they really are. 

There are a couple of lessons to be learned here. First, the three-act 
structure can serve as a model for us when it comes to communicat¬ 
ing in general. Second, that conflict and tension are an integral part 
of story. We'll come back to these ideas shortly and explore some 
concrete applications. In the meantime, let's see what we can learn 
from an expert storyteller from the movies. 


Storytelling and the cinema 

Robert McKee is an award-winning writer and director and a well- 
respected screenwriting lecturer (his former students include 63 
Academy Award and 164 Emmy Award winners, and his book, Story, 
is required reading in many university cinema and film programs). In 
an interview for Harvard Business Review, he discusses persuasion 
through storytelling and examines how storytelling can be lever¬ 
aged in a business setting. McKee says there are two ways to per¬ 
suade people: 

The first is conventional rhetoric. In the business world, this typically 
takes the form of PowerPoint slides filled with bulleted facts and sta¬ 
tistics. It's an intellectual process. But it is problematic, because while 
you're trying to persuade your audience, they are arguing with you 
in their heads. McKee says, "If you do succeed in persuading them, 
you've only done so on an intellectual basis. That's not good enough, 
because people are not inspired to act by reason alone" (Fryer, 2003). 

Think about what Red Riding Hood would look like if we reduced 
the story to conventional rhetoric. Libby Spears does an amusing 
version of this in her slide deck, Little Red Riding Hood and the Day 
PowerPoint Came to Town. Here is my take on it—bullets on a Power¬ 
Point slide might look something like the following: 

• Red Riding Hood (RRH) has to walk 0.54 mi from Point A (home) 
to Point B (Grandma's) 
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• RRH meets Wolf, who (1) runs ahead to Grandma's, (2) eats her, 
and (3) dresses in her clothes 

• RRH arrives at Grandma's at 2pm, asks her three questions 

• Identified problem: after third question, Wolf eats RRH 

• Solution: vendor (Woodsman) employs tool (ax) 

• Expected outcome: Grandma and RRH alive, wolf is not 

When reduced to the facts, it's not so interesting, is it? 

The second way to persuade, according to McKee, is through story. 
Stories unite an idea with an emotion, arousing the audience's atten¬ 
tion and energy. Because it requires creativity, telling a compelling 
story is harder than conventional rhetoric. But delving into your cre¬ 
ative recesses is worth it because story allows you to engage your 
audience on an entirely new level. 

What exactly is story ? At a fundamental level, a story expresses 
how and why life changes. Stories start with balance. Then some¬ 
thing happens—an event that throws things out of balance. McKee 
describes this as "subjective expectation meets cruel reality." This is 
that same tension we discussed in the context of plays. The resulting 
struggle, conflict, and suspense are critical components of the story. 

McKee goes on to say that stories can be revealed by asking a few 
key questions: What does my protagonist want in order to restore 
balance in his or her life? What is the core need? What is keeping 
my protagonist from achieving his or her desire? How would my 
protagonist decide to act in order to achieve his or her desire in the 
face of those antagonistic forces? After creating the story, McKee 
suggests leaning back to consider: Do I believe this? Is it neither an 
exaggeration nor a soft-soaping of the struggle? Is this an honest 
telling, though heaven may fall? 

What can we learn from McKee? The meta-lesson is that we can 
use stories to engage our audience emotionally in a way that goes 
beyond what facts can do. More specifically, we can use the questions 
he outlines to identify stories to frame our communications. We'll 
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consider this further soon. First, let's see what we can learn about sto¬ 
rytelling from a master storyteller when it comes to the written word. 


Storytelling and the written word 

When asked about writing a captivating story by International Paper, 
Kurt Vonnegut (author of novels such as Slaughterhouse-Five and 
Breakfast of Champions ) outlined the following tips, which I've 
excerpted from his short article, "How to Write with Style" (a great 
quick read): 

1 . Find a subject you care about. It is this genuine caring, and not 
your games with language, which will be the most compelling 
and seductive element in your style. 

2. Do not ramble, though. 

3. Keep it simple. Great masters wrote sentences which were 
almost childlike when their subjects were most profound. "To 
be or not to be?" asks Shakespeare's Hamlet. The longest word 
is three letters. 

4. Have the guts to cut. If a sentence, no matter how excellent, 
does not illuminate your subject in some new and useful way, 
scratch it out. 

5. Sound like yourself. I myself find that I trust my own writing 
most, and others seem to trust it most, too, when I sound most 
like a person from Indianapolis, which is what I am. 

6. Say what you meant to say. If I broke all the rules of punctua¬ 
tion, had words mean whatever I wanted them to mean, and 
strung them together higgledy-piggledy, I would simply not be 
understood. 

7. Pity the readers. Our audience requires us to be sympathetic 
and patient teachers, ever willing to simplify and clarify. 

This advice contains a number of gems that we can apply in the 
context of storytelling. Keep it simple. Edit ruthlessly. Be authentic. 
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Don't communicate for yourself—communicate for your audience. 
The story is not for you; the story is for them. 

Now that we've learned some lessons from the masters, let's con¬ 
sider how we can construct our stories. 


Constructing the story 

We introduced the foundations of a narrative in Chapter 1 with the 
Big Idea, 3-minute story, and storyboarding to outline content to 
include while starting to consider order and flow. We learned how 
important it is to identify our audience—both who they are and what 
we need them to do. In the interim, we also learned how to perfect 
the data visualizations that we'll include in ourcommunication. Now 
that we're set on that front, it is time to turn back to the story. Story 
is what ties together information, giving our presentation or com¬ 
munication a framework for our audience to follow. 

Perhaps Vonnegut appreciated Aristotle's simple yet profound obser¬ 
vation that a story has a clear beginning, middle, and end. For a 
concrete example, think back to what we considered with Red Rid¬ 
ing Hood: the magical combination of plot, twists, and ending. We 
can use this idea of beginning, middle, and end—taking inspiration 
from the three-act structure—to set up the stories that we want to 
communicate with data. Let's discuss each of these pieces and the 
specifics to consider when crafting your story. 


The beginning 

The first thing to do is introduce the plot, building the context for 
your audience. Consider this the first act. In this section, we set up 
the essential elements of story—the setting, main character, unre¬ 
solved state of affairs, and desired outcome—getting everyone on 
common ground so the story can proceed. We should involve our 
audience, piquing their interest and answering the questions that are 
likely on their mind: Why should I pay attention? What is in it for me? 
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In his book, Beyond Bullet Points, Cliff Atkinson outlines the follow¬ 
ing questions to consider and address when it comes to setting up 
the story: 

1. The setting: When and where does the story take place? 

2. The main character: Who is driving the action? (This should be 
framed in terms of your audience!) 

3. The imbalance: Why is it necessary, what has changed? 

4. The balance: What do you want to see happen? 

5. The solution: How will you bring about the changes? 

Note the similarity between the questions above and those raised 
by McKee that we covered earlier. 


Using PowerPoint to tell stories 


C liff Atkinson uses PowerPoint to tell stories, leverag¬ 
ing the basic architecture of the three-act structure. His 
book, Beyond Bullet Points, introduces a story template and 
offers practical advice using PowerPoint to help users cre¬ 
ate stories with their presentations. More on this and related 
resources can be found at beyondbulletpoints.com. 


Another way to think about the imbalance-balance-solution in your 
communication is to frame it in terms of the problem and your rec¬ 
ommended solution. If you find yourself thinking, But I don't have 
a problem !—you may want to reconsider. As we've discussed, con¬ 
flict and dramatic tension are critical components of a story. A story 
where everything is rosy and is expected to continue to be is not so 
interesting, attention-grabbing, or action-inspiring. Think of conflict 
and tension—between the imbalance and balance, or in terms of the 
problem on which you are focusing—as the storytelling tools that 
will help you to engage your audience. Frame your story in terms 
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of their (your audience's) problem so that they immediately have a 
stake in the solution. Nancy Duarte calls this tension "the conflict 
between what is and what could be." There is always a story to tell. 
If it's worth communicating, it's worth spending the time necessary 
to frame your data in a story. 


The middle 

Once you've set the stage, so to speak, the bulk of your communi¬ 
cation further develops "what could be," with the goal of convinc¬ 
ing your audience of the need for action. You retain your audience's 
attention through this part of the story by addressing how they can 
solve the problem you introduced. You'll workto convince them why 
they should accept the solution you are proposing or act in the way 
you want them to. 

The specific content will take different forms depending on your sit¬ 
uation. The following are some ideas for content that might make 
sense to include as you build out your story and convince your audi¬ 
ence to buy in: 

• Further develop the situation or problem by covering relevant 
background. 

• Incorporate external context or comparison points. 

• Give examples that illustrate the issue. 

• Include data that demonstrates the problem. 

• Articulate what will happen if no action is taken or no change is 
made. 

• Discuss potential options for addressing the problem. 

• Illustrate the benefits of your recommended solution. 

• Make it clear to your audience why they are in a unique position 
to make a decision or drive action. 

When considering whatto include in yourcommunication, keep your 
audience top of mind. Think about what will resonate with them and 
motivate them. For example, will your audience be motivated to act 
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by making money, beating the competition, gaining market share, 
saving a resource, eliminating excess, innovating, learning a skill, or 
something else? If you can identify what motivates your audience, 
consider framing your story and the need for action in terms of this. 
Also think about whether and when data will strengthen your story 
and integrate it as makes sense. Throughout your communication, 
make the information specific and relevant to your audience. The 
story should ultimately be about your audience, not about you. 


Write the headlines first 


W hen it comes to structuring the flow of your overall pre¬ 
sentation or communication, one strategy is to create 
the headlines first. Think back to the storyboarding that we 
discussed in Chapter 1. Write each headline on a Post-it note. 
Play with the order to create a clear flow, connecting each 
idea to the next in a logical fashion. Establishing this sort of 
structure helps ensure that there is a logical order for your 
audience to follow. Make each headline the title of your pre¬ 
sentation slides or the section title in a written report. 


The end 

Finally, the story must have an end. End with a call to action: make 
it totally clear to your audience what you want them to do with the 
new understanding or knowledge that you've imparted to them. 
One classic way to end a story is to tie it back to the beginning. At 
the beginning of our story, we set up the plot and introduced the 
dramatic tension. To wrap up, you can think about recapping this 
problem and the resulting need for action, reiterating any sense of 
urgency and sending your audience off ready to act. 

When it comes to the order and telling of our story, another impor¬ 
tant consideration is the narrative structure, which we'll discuss next. 
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The narrative structure 

In order to be successful, a narrative has to be central to the com¬ 
munication. These are words—written, spoken, or a combination of 
the two—that tell the story in an order that makes sense and con¬ 
vinces the audience why it's important or interesting and attention 
to it should be paid. 

The most beautiful data visualization runs the risk of falling flat with¬ 
out a compelling narrative to go with it. 

You've perhaps experienced this before if you've ever sat through 
a great presentation that used run-of-the-mill slides. A skilled pre¬ 
senter can overcome mediocre materials. A strong narrative can 
overcome less-than-ideal visuals. This is not to say that you shouldn't 
spend time making your data visualizations and visual communica¬ 
tions great, but ratherto underscore the importance of a compelling 
and robust narrative. Nirvana in communicating with data is reached 
when the effective visuals are combined with a powerful narrative. 

Let's discuss some specific considerations when it comes to both the 
order of the story and the spoken and written narrative. 


Narrative flow: the order of your story 

Think about the order in which you want your audience to experi¬ 
ence your story. Are they a busy audience who will appreciate if you 
lead with what you want from them? Or are they a new audience, 
with whom you need to establish credibility? Do they care about 
your process or just want the answer? Is it a collaborative process 
through which you need their input? Are you asking them to make 
a decision or take an action? How can you best convince them to 
act in the way you want them to? The answers to these questions 
will help you to determine what sort of narrative flow will work best, 
given your specific situation. 

One important basic point here is that your story must have an order 
to it. A collection of numbers and words on a given topic without 
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structure to organize them and give them meaning is useless. The 
narrative flow is the spoken and written path along which you take 
your audience over the course of your presentation or communica¬ 
tion. This path should be clear to you. If it isn't, there certainly isn't 
a way to make it clear to your audience. 


Help me turn this into a story! 


W hen a client comes to me with a presentation deck 

and asks for help, the first thing I have them do is set 
the deck aside. I walk them through exercises that help them 
articulate the Big Idea and 3-minute story that we discussed 
in Chapter 1. Why? You have to have a solid understanding 
of what you want to communicate before you craft the com¬ 
munication. Once you have the Big Idea and 3-minute story 
articulated, you can start to think about what narrative flow 
makes sense and how to organize your deck. 

One way to do this is to include a slide at the beginning of 
the deck that bullets the main points in your story. This will 
become an executive summary that says to your audience 
at the onset of the presentation, "here's what we will cover 
in our time together." Then organize the remaining slides to 
follow this same flow. Finally, at the end of the presentation, 
you'll repeat this ("here's what we covered") with emphasis 
on any actions you need your audience to take, or any deci¬ 
sions you need them to make. This helps to establish a struc¬ 
ture to your presentation and make that structure clear to 
your audience. It also leverages the power of repetition to 
help your message stick with your audience. 


One way to order the story—the one that typically comes most 
naturally—is chronologically. By way of example, if we think about 
the general analytical process, it looks something like this: we iden¬ 
tify a problem, we gather data to better understand the situation, 
we analyze the data (look at it one way, look at it another way, tie 
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in other things to see if they had an impact, etc.), we emerge with 
a finding or solution, and based on this we have a recommended 
action. One way to approach the communication of this to our audi¬ 
ence is to follow that same path, taking the audience through it in 
the same way we experienced it. This approach can work well if you 
need to establish credibility with your audience, or if you know they 
care about the process. But chronological is not your only option. 

Another strategy is to lead with the ending. Start with the call to 
action: what you need your audience to know or do. Then back up 
into the critical pieces of the story that support it. This approach can 
work well if you've already established trust with your audience or 
you know they are more interested in the "so what" and less inter¬ 
ested in how you got there. Leading with the call to action has the 
additional benefit of making it immediately clear to your audience 
what role they are meant to play or what lens they should have on as 
they consider the rest of your presentation or communication, and 
why they should keep listening. 

As part of making the narrative flow clear, we should consider what 
pieces of the story will be written and what will be conveyed through 
spoken words. 


The spoken and written narrative 

If you're giving a presentation—whetherformally standing in front of 
a room, or more informally seated around a table—a good portion 
of the narrative will be spoken. If you're sending an email or report, 
the narrative is likely entirely written. Each format presents its own 
opportunities and challenges. 

With a live presentation, you have the benefit of words on the screen 
or page being reinforced by the words you are saying. In this man¬ 
ner, your audience has the opportunity to both read and hear what 
they need to know, strengthening the information. You can use your 
voiceover to make the "so what" ofeach visual clear, make it relevant 
to your audience, and tie one idea to the next. You can respond to 
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questions and clarify as needed. One challenge with a live presen¬ 
tation is that you must ensure what your audience needs to read on 
a given slide or section isn't so dense or consuming that their atten¬ 
tion is focusing on that instead of listening to you. 

Another challenge is that your audience can act unpredictably. They 
can ask questions that are off topic, jump to a point later in the pre¬ 
sentation, or do otherthings to push you offtrack. This is one reason 
it's important—especially in a live presentation setting—to articulate 
clearly the role you want your audience to play and how your presen¬ 
tation is structured. For example, if you're anticipating an audience 
who will want to go offtrack, start by saying something like, "I know 
you are going to have a lot of questions. Write them down as they 
come up and I will make sure to leave time at the end to address 
any that aren't answered. But first, let's take a look at the process 
our team went through to reach our conclusion, which will lead us 
to what we are asking of you today." 

As another example, if you're planning to lead with the ending and 
this differs from the typical approach—tell your audience that this is 
what you're doing. You might say something like, "Today, I'm going 
to start with what we're asking of you. The team did some robust 
analysis that led us to this conclusion and we weighed several differ¬ 
ent options. I will take you through all of this. But before I do, I want 
to spotlight what we are asking of you today, which is ..." By telling 
your audience how you are going to structure your presentation, it 
can make both you and them more comfortable. It helps your audi¬ 
ence to know what to expect and what role they are meant to play. 

In a written report (or a presentation deck that is sent around instead 
of presented or also used as a "leave behind" to remind people of 
the content after you've delivered the presentation), you don't have 
the benefit of the voiceoverto make the sections or slides relevant— 
rather, they must do this on their own. The written narrative is what 
will achieve this. Think about what words need to be present. In the 
case when something will be sent around without you thereto explain 
it, it's especially important to make the "so what" of each slide or 
section clear. You've probably experienced when this has not been 
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done well: you're looking through a presentation and encounter a 
slide of bulleted facts, or a graph or table packed with numbers, 
and are thinking, "I have no idea what I'm meant to get out of this." 
Don't let this happen to your work: make sure the words are present 
to make your point clear and relevant to your audience. 

Getting feedback from someone not as familiar with the topic can 
be especially useful in this situation. Doing so will help you uncover 
issues with clarity and flow, or questions your audience may have, so 
you can address those proactively. In terms of benefits of the written 
report approach, if you make your structure clear, your audience can 
turn directly to the parts that interest them. 

While we establish narrative structure and flow, the power of rep¬ 
etition is another strategy we can leverage within our storytelling. 


The power of repetition 

Thinking back to Red Riding Hood, one of the reasons I remember 
the story is due to repetition. I was told and read the story countless 
times as a little girl. As we discussed in Chapter 4, important infor¬ 
mation is gradually transferred from short-term memory into long¬ 
term memory. The more the information is repeated or used, the 
more likely it is to eventually end up in long-term memory, or to be 
retained. That's why the story of Red Riding Hood remains in my head 
today. We can leverage this power of repetition in the stories we tell. 


Repeatable sound bites 


" I f people can easily recall, repeat, and transfer your mes- 
I sage, you did a great job conveying it." To help facilitate 
this, Nancy Duarte recommends leveraging repeatable sound 
bites: succinct, clear, and repeatable phrases. Checkout her 
book, Resonate, to learn more. 
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When it comes to employing the power of repetition, let's explore a 
concept called Bing, Bang, Bongo. My junior high English teacher 
introduced this idea to me when we were learning to write essays. 
The concept stuck with me—perhaps due to the consonance of the 
"Bing, Bang, Bongo" name and my teacher's use of it as a repeat- 
able sound bite—and it can be leveraged when we need to tell a 
story with data. 

The idea is that you should first tell your audience what you're going 
to tell them ("Bing," the introduction paragraph in your essay). Then 
you tell it to them ("Bang," the actual essay content). Then you sum¬ 
marize what you just told them ("Bongo," the conclusion). Applying 
this to a presentation or report, you can start with an executive sum¬ 
mary that outlines for your audience what you are going to cover, 
then you can provide the detail or main content of your presenta¬ 
tion, and finally end with a summary slide or section that reviews the 
main points you covered (Figure 7.1). 
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FIGURE 7.1 Bing, bang, bongo 


If you're the one preparing or giving the presentation or writing 
the report, this may feel redundant, since you're already familiar 
with the content. But to your audience—who is not as close to the 
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content—it feels nice. You've set their expectations on what you're 
going to cover, then provided detail, and then recapped. The rep¬ 
etition helps cement it in their memory. After hearing your message 
three times, they should be clear on what they are meant to know 
and do from the story you've just told. 

Bing, Bang, Bongo is one strategy to leverage to help ensure that 
your story is clear. Let's consider some additional tactics. 


Tactics to help ensure that your story is clear 

There are a number of concepts I routinely discuss in my workshops 
for helping to ensure that the story you're telling in your commu¬ 
nication comes across. These apply mainly to a presentation deck. 
While not always the case, I find that this is often the primary form of 
communicating analytical results, findings, and recommendations at 
many companies. Some of the concepts we'll discuss will be appli¬ 
cable to written reports and other formats as well. 

Let's discuss four tactics to help ensure that your story is clear in your 
presentation: horizontal logic, vertical logic, reverse storyboarding, 
and a fresh perspective. 


Horizontal logic 

The idea behind horizontal logic is that you can read just the slide 
title of each slide throughout your deck and, together, these snip¬ 
pets tell the overarching story you want to communicate. It is impor¬ 
tant to have action titles (not descriptive titles) for this to work well. 

One strategy is to have an executive summary slide up front, with 
each bullet corresponding to a subsequent slide title in the same 
order (Figure 7.2). This is a nice way of setting it up so your audience 
knows what to expect and then is taken through the detail (think 
back to the Bing, Bang, Bongo approach we covered previously). 
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FIGURE 7.2 Horizontal logic 


Checking for horizontal logic is one approach to test whether the 
story you want to tell is coming through clearly in your deck. 


Vertical logic 

Vertical logic means that all information on a given slide is self¬ 
reinforcing. The content reinforces the title and vice versa. The words 
reinforce the visual and vice versa (Figure 7.3). There isn't any extra¬ 
neous or unrelated information. Much of the time, the decision on 
whatto eliminate or push to an appendix is as important (sometimes 
more so) as the decision on what to retain. 



FIGURE 7.3 Vertical logic 
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Employing horizontal and vertical logictogetherwill help ensure that 
the story you want to tell comes across clearly in your communication. 


Reverse storyboarding 

When you storyboard atthe onset of building a communication, you 
craft the outline of the story you intend to tell. As the name implies, 
reverse storyboarding does the opposite. You take the final commu¬ 
nication, flip through it, and write down the main point from each 
page (it's a nice way to test your horizontal logic as well). The result¬ 
ing list should look like the storyboard or outline for the story you 
want to tell (Figure 7.4). If it doesn't, this can help you understand 
structurally where you might want to add, remove, or move pieces 
around to create the overall flow and structure for the story that 
you're interested in conveying. 




FIGURE 7.4 Reverse storyboarding 


A fresh perspective 

We've discussed the value of a fresh perspective to help see 
through your audience's lens when it comes to your data visu¬ 
alization (Figure 7.5). Seeking this sort of input for your overall 
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presentation can be immensely helpful as well. Once you've crafted 
your communication, give it to a friend or colleague. It can be some¬ 
one without any context (it's actually helpful if it is someone without 
any context, because this puts them in a much closer position to 
your audience than you can be, given your intimate knowledge of 
the subject matter). Ask them to tell you what they pay attention to, 
what they think is important, and where they have questions. This will 
help you understand whether the communication you've crafted is 
telling the story you mean to tell or, in the case where it isn't exactly, 
help you identify where to concentrate your iterations. 



FIGURE 7.5 Afresh perspective 


There is incredible value in getting a fresh perspective when it 
comes to communicating with data in general. As we become sub¬ 
ject matter experts in our space, it becomes impossible for us to 
take a step back and look at what we've created (whether a single 
graph or a full presentation) through our audience's eyes. But that 
doesn't mean you can't see what they see. Leverage a friend or 
colleague for their fresh perspective. Help ensure that your com¬ 
munication hits the mark. 


In closing 

Stories are magical. They have the power of captivating us and 
sticking with us in ways that facts alone cannot. They lend struc¬ 
ture. Why wouldn't you leverage this potential when crafting your 
communications? 


In closing 
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When we construct stories, we should do so with a beginning (plot), 
middle (twists), and end (call to action). Conflict and tension are key 
to grabbing and maintaining your audience's attention. Another cen¬ 
tral component to story is the narrative, which we should consider 
in terms of both order (chronological or lead with ending) and man¬ 
ner (spoken, written, or a combination of the two). We can utilize the 
power of repetition to help our stories stick with our audience. Tac¬ 
tics such as horizontal and vertical logic, reverse storyboarding, and 
seeking a fresh perspective can be employed to help ensure that our 
stories come across clearly in our communications. 

The main character in every story we tell should be the same: our 
audience. It is by making our audience the protagonist that we can 
ensure the story is about them, not about us. By making the data we 
want to show relevant to our audience, it becomes a pivotal point 
in our story. No longer will you just show data. Rather, you will tell a 
story with data. 

With that, you can consider your final lesson learned. You now know 

how to tell a story. 

Next, let's look at an example of the entire storytelling with data pro¬ 
cess, from start to finish. 



chapter eight 


pulling it all together 


Up to this point, we've focused on individual lessons that, together, 
set you up for success when it comes to effectively visualizing and 
communicating with data. To refresh your memory, we've covered 
the following lessons: 

1. Understand the context (Chapter 1) 

2. Choose an appropriate display (Chapter 2) 

3. Eliminate clutter (Chapter 3) 

4. Draw attention where you want it (Chapter 4) 

5. Think like a designer (Chapter 5) 

6. Tell a story (Chapter 7) 


In this chapter, we will look at the comprehensive storytelling with 
data process from start to end—applying each of the preceding 
lessons—using a single example. 
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Let's begin by considering Figure 8.1, which shows average retail 
price over time for five consumer products (A, B, C, D, and E). Spend 
a moment studying it. 


Price has declined for all products on the market 
since the launch of Product C in 2010 

Average Retail Product Price Per Year 
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FIGURE 8.1 Original visual 



When presented with this graph, it's easy to start picking it apart. 
But before we discuss the best way to visualize the data shown in 
Figure 8.1, let's take a step back and consider the context. 


Lesson 1: understand the context 

The first thing to do when faced with a visualization challenge is to 
make sure you have a robust understanding of the context and what 
you need to communicate. We must identify a specific audience and 
what they need to know or do, and determine the data we'll use to 
illustrate our case. We should craft the Big Idea. 
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In this case, let's assume we work for a startup that has created a 
consumer product. We are starting to think about how to price the 
product. One of the considerations in this decision-making process— 
the one we will focus on here—is how competitors' retail prices for 
products in this marketplace have changed over time. There is an 
observation made with the original visual that may be important: 
"Price has declined for all products on the market since the launch 
of Product C in 2010." 

If we pause to consider specifically the who, what, and how, let's 
assume following: 

Who: VP of Product, the primary decision maker in establishing 
our product's price. 

What: Understand how competitors' pricing has changed over 
time and recommend a price range. 

How: Show average retail price over time for Products A, B, C, 
D, and E. 

The Big Idea, then, could be something like: Based on analysis of 
pricing in the market overtime, to be competitive, we recommend 
introducing our product at a retail price in the range $ABC-$XYZ. 

Next, let's consider some different ways to visualize this data. 


Lesson 2: choose an appropriate display 

Once we've identified the data we want to show, next comes the 
challenge of determining how to best visualize it. In this case, we 
are most interested in the trend in price over time for each product. 
If we look back to Figure 8.1, the variance in colors across the bars 
distract from this, making the exercise more difficult than necessary. 
Bear with me, as we're going to go through more iterations of looking 
at this data than you might typically. The progression is interesting 
because it illustrates how different views of the data can influence 
what you pay attention to and the observations you can easily make. 
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First, let's remove the visual obstacle of the variance in color and 
view the resulting graph, shown in Figure 8.2. 


Average Retail Product Price Per Year 
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FIGURE 8.2 Remove the variance in color 

If you're tempted to continue decluttering at this point, you are not 
alone. I had to resist the urge since that's something I typically do 
as I go along. In this case, let's refrain from doing so until the next 
section, where we can address it all at once. 

Since the emphasis in the original headline was on what happened 
since Product C was launched in 2010, let's highlight the relevant 
pieces of data to make it easier to focus our attention there for a 
moment. See Figure 8.3. 
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Average Retail Product Price per Year 
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FIGURE 8.3 Emphasize 2010 forward 

Upon studying this, we see clear declines in the average retail price 
for Products A and B in the time period of interest, but this doesn't 
appear to hold true for the products that were launched later. We 
will definitely need to change the headline from the original visual 
to reflect this when we tell our comprehensive story. 

If you've been thinking we should try a line graph here instead of a bar 
chart—since we are primarily interested in the trend overtime—you 
are absolutely right. In doing so, we also eliminate the stairstep view 
that bars create somewhat artificially. Let's see what lines would look 
like with the same layout as above. This is illustrated in Figure 8.4. 
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Average Retail Product Price per Year 
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FIGURE 8.4 Change to line graph 


The view in Figure 8.4 allows us to see what's happening over time 
more clearly for one product at a time. But it is hard to compare the 
products at a given point in time to one another. Graphing all of 
the lines against the same x-axis will solve this. This will also reduce 
the clutter and redundancy of the multiple year labels. The resulting 
graph might look like Figure 8.5. 
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Average Retail Product Price per Year 
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FIGURE 8.5 Single line graph for all products 


With the transition to the new graph setup, Excel added back the 
color that we removed in an earlier step (tying the data to the accom¬ 
panying legend at the bottom). Let's ignore that for a moment while 
we consider whether this view of the data will meet our needs. If we 
revisit our purpose, it is to understand how competitors' prices have 
changed overtime. The way the data is shown in Figure 8.5 allows for 
this with relative ease. We can make taking in this information even 
easier by eliminating clutter and drawing attention where we want it. 


Lesson 3: eliminate clutter 

Figure 8.5 shows what our visual looks like when we rely on the default 
settings of our graphing application (Excel). We can improve this 
with the following changes: 

• De-emphasize the chart title. It needs to be present, but doesn't 
need to attract as much attention as it does when written in bold 
black. 
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• Remove chart border and gridlines, which take up space with¬ 
out adding much value. Don't let unnecessary elements distract 
from your data! 

• Push the x- and y-axis lines and labels to the background by 

making them grey. They shouldn't compete visually with the data. 
Modify the x-axis tick marks so they align with the data points. 

• Remove the variance in colors between the various lines. 

We can use color more strategically, which we'll discuss further 
momentarily. 

• Label the lines directly, eliminating the work of going back and 
forth between the legend and the data to understand what is 
being shown. 

Figure 8.6 shows what the graph looks like after making these 
changes. 
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FIGURE 8.6 Eliminate clutter 


Next, let's explore how we can focus our audience's attention. 
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Lesson 4: draw attention where you want your 
audience to focus 

With the view shown in Figure 8.6, we can much more easily see and 
comment on what's happening over time. Let's explore how we can 
focus on different aspects of the data through strategic use of pre- 
attentive attributes. 

Consider the initial headline: "Price has declined for all products on 
the market since the launch of Product C in 2010." Upon acloserlook 
at the data, I might modify it to say something like, "After the launch 
of Product C in 2010, the average retail price of existing products 
declined." Figure 8.7 demonstrates how we can tie the important 
points in the data to these words through the strategic use of color. 


Average Retail Product Price per Year 



In addition to the colored segments of the lines in Figure 8.7, atten¬ 
tion is also drawn to the introduction of Product C in 2010 through 
the addition of a data marker at that point. This is tied visually to 
the subsequent decrease over time in Products A and B through the 
consistent use of color. 
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T ypically, you format a series of data (a line or a series of 
bars) all at once. Sometimes, however, it can be useful 
to have certain points formatted differently—for example, to 
draw attention to specific parts, as illustrated in Figures 8.7, 
8.8, and 8.9. To do this, click on the data series once to high¬ 
light it, then click again to highlight just the point of interest. 
Right-click and select Format Data Point to open the menu 
that will allow you to reformat the specific point as desired 
(for example, to change the color or add a data marker). 
Repeat this process for each data point you want to modify. 

It takes time, but the resulting visual is easier to comprehend 
for your audience. It is time well spent! 


We can use this same view and strategy to concentrate on another 
observation—one perhaps more interesting and noteworthy: "With 
the launch of a new product in this space, it is typical to see an initial 
average retail price increase, followed by a decline." See Figure 8.8. 
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FIGURE 8.8 Refocus the audience's attention 
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It might also be interesting to note, "As of 2014, retail prices have 
converged across products, with an average retail price of $223, 
ranging from a low of $180 (Product C) to a high of $260 (Product 
A)." Figure 8.9 uses color and data markers to draw our attention to 
the specific points in the data that support this observation. 


Average Retail Product Price per Year 



With each different view of the data, the use of preattentive attri¬ 
butes allows you to more clearly see certain things. This strategy 
can be used to highlight and tell different pieces of a nuanced story. 

But before we continue thinking through how to best tell the story, 
let's put on our designer hats and perfect the visual. 


Lesson 5: think like a designer 

Though you may not have recognized it explicitly as such, we've 
already been thinking like a designer through this process. Form 
follows function: we chose a visual display (form) that will allow our 
audience to do what we need them to do (function) with ease. When 
it comes to using visual affordances to make it clear how our audi¬ 
ence should interact with our visual, we've already taken steps to 
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cut clutter and de-emphasize some elements of the graph, while 
emphasizing and drawing attention to others. 

We can further improve this visual by leveraging the lessons we cov¬ 
ered in Chapter 5 with respect to accessibility and aesthetics. Spe¬ 
cifically, we can: 

• Make the visual accessible with text. We can use simpler text in 
the graph title and capitalize only the first word to make it easier 
to comprehend and quicker to read. We also need to add axis 
titles to both the vertical and horizontal axes. 

• Align elements to improve aesthetics: The center alignment of 
the graph title leaves it hanging in space and doesn't align it with 
any other elements; we should upper-left-most align the graph 
title. Align the y-axis title vertically with the uppermost label and 
the x-axis title horizontally with the leftmost label. This creates 
cleaner lines and ensures that your audience sees how to inter¬ 
pret what they are looking at before they get to the actual data. 

Figure 8.10 shows what the visual looks like after these changes have 
been made. 
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FIGURE 8.10 Add text and align elements 
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Lesson 6: tell a story 

Finally, it is time to think about how we can use the visual we've cre¬ 
ated in Figure 8.10 as a foundation to walk our audience through the 
story in the way that we want them to experience it. 

Imagine we have five minutes in a live presentation setting under 
the agenda topic: "Competitive Landscape—Pricing." The follow¬ 
ing sequence (Figures 8.11-8.19) illustrates one path we could take 
for telling a story with this data. 


In the next 5 minutes... 

OUR GOAL: 

1 Understand how prices have changed 
over time in the competitive landscape. 

2 Use this knowledge to inform the pricing 
of our product 

We will end with a specific recommendation. 


FIGURE 8.11 
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Products A and B were launched in 2008 at price points of $360+ 
Retail price over time 
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FIGURE 8.12 


They have been priced similarly over time, with B consistently 
slightly lower than A 

Retail price over time 
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FIGURE 8.13 
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In 2014, Products A and B were priced at $260 and $250, 
respectively 

Retail price over time 
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FIGURE 8.14 


Products C, D, and E were each introduced later 

at much lower price points... 

Retail price over time 



FIGURE 8.15 
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...but all have increased in price since their respective launches 
Retail price over time 
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FIGURE 8.16 
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In fact, with the launch of a new product in this space, we tend to 

see an initial price increase, followed by a decrease over time 

Retail price over time 
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FIGURE 8.17 
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As of 2014, retail prices have converged, with an average retail 
price of $223, ranging from a low of $180 (C) to a high of $260 (A) 


Retail price over time 
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FIGURE 8.18 

To be competitive, we recommend introducing our product below 
the $223 average price point in the $150-$200 range 


Retail price over time 
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FIGURE 8.19 







204 


pulling it all together 


Let's consider this progression. We started off by telling our audi¬ 
ence the structure we would follow. I can imagine the voiceover in 
the live presentation could further set the plot before moving to the 
next slide: "As you all know, there are five products that will be our 
key competition in the marketplace," then building the chronological 
price path that those products followed. We can introduce tension 
in the competitive landscape when Products C, D, and E significantly 
undercut existing price points at their respective launches. We can 
then restore a sense of balance as the prices converge. We end with 
a clear call to action: the recommendation for pricing our product. 

By drawing our audience's attention to the specific part of the story 
we want to focus on—either by only showing the relevant points or 
by pushing other things to the background and emphasizing only the 
relevant pieces and pairing this with a thoughtful narrative—we've 
led our audience through the story. 

Here, we've looked at an example telling a story with a single visual. 
This same process and individual lessons can be followed when you 
have multiple visuals in a broader presentation or communication. In 
that case, think about the overarching story that ties it all together. 
Individual stories for a given visualization within that larger presenta¬ 
tion, such as the one we've looked at here, can be considered sub¬ 
plots within the broader storyline. 


In closing 

Through this example, we've seen the storytelling with data process 
from start to finish. We began by building a robust understanding 
of the context. We chose an appropriate visual display. We identi¬ 
fied and eliminated clutter. We used preattentive attributes to draw 
our audience's attention to where we want them to focus. We put 
on our designer hats, adding text to make our visual accessible and 
employing alignment to improve the aesthetics. We crafted a com¬ 
pelling narrative and told a story. 

Consider the before-and-after shown in Figure 8.20. 


In closing 
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Price has declined for all products on the market 
since the launch of Product C in 2010 



To be competitive, we recommend introducing our product below 
the $223 average price point in the $150-5200 range 


Retail price over time 



FIGURE 8.20 Before-and-after 


The lessons we've learned and employed help us move from simply 
showing data to storytelling with data. 

















chapter nine 


case studies 


At this point, you should feel like you have a solid foundation for 
communicating effectively with data. In this penultimate chapter, we 
explore strategies for tackling common challenges faced when com¬ 
municating with data through a number of case studies. 

Specifically, we'll discuss: 

• Color considerations with a dark background 

• Leveraging animation in the visuals you present 

• Establishing logic in order 

• Strategies for avoiding the spaghetti graph 

• Alternatives to pie charts 


Within each ofthese case studies, I'll apply the various lessons we've 
covered when it comes to communicating effectively with data, but 
will limit my discussion mainly to the specific challenge at hand. 
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CASE STUDY 1: Color considerations with a dark 
background 


When it comes to communicating data, I don't typically recommend 
anything other than a white background. Let's take a look at what 
a simple graph looks like on a white, blue, and black background. 
See Figure 9.1. 


White background 


5 



Blue background 
5i 



1 

0 J- 

Jan Feb Mar Apr May 


Black background 

5i 



1 

0 -I- 

Jan Feb Mar Apr May 


FIGURE 9.1 Simple graph on white, blue, and black background 


If you had to describe in a single word how the blue and black back¬ 
grounds in Figure 9.1 make you feel, what would that word be? For 
me, it would be heavy. With the white background, I find it easy to 
focus on the data. The dark backgrounds, on the other hand, pull 
my eyes there—to the background—and away from the data. Light 
elements on a dark background can create a stronger contrast but 
are generally harder to read. Because of this, I typically avoid dark 
and colored backgrounds. 

That said, sometimes there are considerations outside of the ideal 
scenario for communicating with data that must betaken into account, 
such as your company or client's brand and corresponding standard 
template. This was the challenge I faced in one consulting project. 

I didn't recognize this immediately. It was only after I had completed 
my initial revamp of the client's original visual that I realized it just 
didn't quite fit with the look and feel of the work products I'd seen 
from the client group. Their template was bold and in your face with 
a mottled, black background spiked with bright, heavily saturated 







CASE STUDY 1: Color considerations with a dark background 
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colors. In comparison, my visual felt rather meek. Figure 9.2 shows 
a generalized version of my initial makeover of a visual displaying 
employee survey feedback. 

Survey Results: Team X 

Strongly Disagree I Disagree I Neutral I Agree I Strongly Agree 
Percent of Total 

0 % 20 % 40 % 60 % 80 % 100 % 


Survey item A 


1 % 


33% 


Survey item A 

ranked highest 
for team X 


Survey item B 



Survey item C 


Survey item D 


11 % 


9% 


Dissatisfaction 
was greatest for 

Survey item D 


FIGURE 9.2 Initial makeover on white background 


In an endeavor to create something more in sync with the client's 
brand, I remade my own makeover, leveraging the same dark back¬ 
ground I'd seen used in some of the other examples shared. In 
doing so, I had to reverse my normal thought process. With a white 
background, the further a color is from white, the more it will stand 
out (so grey stands out less, whereas black stands out very much). 
With a black background, the same is true, but black becomes the 
baseline (so grey stands out less, and white stands out very much). 
I also realized some colors that are typically verboten with a white 
background (for example, )areffi 

IS (I didn't use yellow in this particular example but did 
in some others). 


Figure 9.3 depicts how my "more in line with the client's brand" ver¬ 
sion of the visual looked. 
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FIGURE 9.3 Remake on dark background 


While the content is exactly the same, note how different Figure 9.3 
feels compared to Figure 9.2. This is a good illustration of how color 
can impact the overall tone of a visualization. 


CASE STUDY 2: Leveraging animation in the 
visuals you present 

One conundrum commonly faced when communicating with data 
is when a single view of the data is used for both presentation and 
report. When presenting content in a live setting, you want to be 
able to walk your audience through the story, focusing on just the rel¬ 
evant part of the visual. However, the version that gets circulated to 
your audience—as a pre-read or takeaway, or for those who weren't 
able to attend the meeting—needs to be able to stand on its own 
without you, the presenter, there to walk the audience through it. 

Too often, we use the exact same content and visuals for both pur¬ 
poses. This typically renders the content too detailed for the live pre¬ 
sentation (particularly if it is being projected on the big screen) and 
sometimes not detailed enough forthe circulated content. This gives 
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rise to the slideument—part presentation, part document, and not 
exactly meeting the needs of either—which we touched upon briefly 
in Chapter 1. In the following, we'll look at a strategy for leveraging 
animation coupled with an annotated line graph to meet both the 
presentation and circulation needs. 

Let's assume that you work for a company that makes online social 
games. You are interested in telling the story around how active users 
for a given game—let's call it Moonville—have grown over time. 

You could use Figure 9.4 to talk about growth since the launch of 
the game in late 2013. 


Moonville: active users over time 

100,000 



Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 

FIGURE 9.4 Original graph 

The challenge, however, is that when you put this much data in front 
of your audience, you lose control over their attention. You might 
be talking about one part of the data while they are focusing some¬ 
where else entirely. Perhaps you want to tell the story chronologi¬ 
cally, butyouraudience mayjump immediately to the sharp increase 
in 2015 and wonder what drove that. When they do so, they stop 
listening to you. 
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Alternatively, you can leverage animation to walk your audience 
through your visual as you tell the corresponding points of the story. 
For example, I could start with a blank graph. This forces the audi¬ 
ence to look at the graph details with you, rather than jump straight 
to the data and start trying to interpret it. You can use this approach 
to build anticipation within your audience that will help you to retain 
their attention. From there, I subsequently show or highlight on/y the 
data that is relevant to the specific point I am making, forcing the 
audience's attention to be exactly where I want it as I am speaking. 

I might say—and show—the following progression: 

Today, I'm going to talk you through a success story: the increase 
in Moonville users over time. First, let me set up what we are look¬ 
ing at. On the vertical y-axis of this graph, we're going to plot active 
users. This is defined as the number of unique users in the past 30 
days. We'll look at how this has changed over time, from the launch 
in late 2013 to today, shown along the horizontal x-axis. (Figure 9.5) 


Moonville: active users over time 
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Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 


FIGURE 9.5 
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We launched Moonville In September 2013. By the end of that first 
month, we had just over 5,000 active users, denoted by the big blue 
dot at the bottom left of the graph. (Figure 9.6) 


Moonville: active users over time 
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Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 

FIGURE 9.6 
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Early feedback on the game was mixed. In spite of this—and our 
practically complete lack of marketing—the number of active users 
nearly doubled in the first four months, to almost i 1,000 active users 
by the end of December. (Figure 9.7) 


Moonville: active users over time 
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Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 

FIGURE 9.7 


CASE STUDY 2: Leveraging animation in the visuals you present 


In early 2014, the number of active users increased along a steeper 
trajectory. This was primarily the result of the friends and family pro¬ 
motions we ran during this time to increase awareness of the game. 
(Figure 9.8) 


Moonville: active users over time 
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Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 

FIGURE 9.8 
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Growth was pretty flat over the rest of 2014 as we halted all marketing 
efforts and focused on quality improvements to the game. (Figure 9.9) 


Moonville: active users over time 



Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 


FIGURE 9.9 
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Uptake this year, on the other hand, has been incredible, surpass¬ 
ing our expectations. The revamped and improved game has gone 
viral. The partnerships we've forged with social media channels have 
proven successful for continuing to increase our active user base. At 
recent growth rates, we anticipate we'll surpass 100,000 active users 
in June! (Figure 9.10) 


Moonville: active users over time 
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Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 

FIGURE 9.10 

For the more detailed version that you circulate as a follow up or for 
those who missed your (stellar) presentation, you can leverage a ver¬ 
sion that annotates the salient points of the story on the line graph 
directly, as shown in Figure 9.11. 
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Moonville: active users over time 
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Data source: ABC Report. For purpose of analysis "active user" is defined as the number of unique users in the past 30 days. 

FIGURE 9.11 

This is one strategy for creating a visual (or, in this case, set of visuals) 
that meets both the needs of your live presentation and the circu¬ 
lated version. Note that with this approach, it is imperative that you 
know your story well to be able to narrate without relying on your 
visuals (something you should always aim for regardless). 

If you're leveraging presentation software, you can set up all of the 
above on a single slide and use animation for the live presentation, 
having each image appear and disappear as needed to form the 
desired progression. Put the final annotated version on top so it's 
all that shows on the printed version of the slide. If you do this, you 
can use the exact same deck for the presentation and the commu¬ 
nication that you circulate. Alternatively, you can put each graph on 
a separate slide and flip through them; in this case, you'd only want 
to circulate the final annotated version. 


Sep-Dec 2013 

Moonville 
launched with 
5K active users 
in Sep. Early 
feedback was 
mixed; still, the 
number of 
active users 
nearly doubled 
in the first four 
months. 


Jan-Mar 2014 

The number of 
active users 
increased with 
steeper 
trajectory as a 
result of friends 
and family 
promotions. 


28,746 


Mar-Dec 2014 

Growth was marginal 
through the rest of 2014 as 
we halted marketing 
efforts to focus on quality 
improvements. 


YTD 2015 

The revamped game 
plus partnerships 
with social 
media channels 
have been very 
successful. 


94,255 


39,214. 


Given recent 
growth rate, we 
anticipate we will 
surpass 100K active 
users in June. 
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10,931. 


Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May 
2013 2014 2015 


CASE STUDY 3: Logic in order 


219 


CASE STUDY 3: Logic in order 

There should be logic in the order in which you display information. 

The above statement probably goes without saying. Yet, like so 
many things that seem logical when we read them or hear them or 
say them out loud, too often we don't put them into practice. This 
is one such example. 

While I would say my introductory sentence is universally true, I'll 
focus here on a very specific example to illustrate the concept: lever¬ 
aging order for categorical data in a horizontal barchart. 

First, let's set the context. Let's say you work at a company that sells a 
product that has various features. You've recently surveyed your users 
to understand whether they are using each of the features and how 
satisfied they've been with them and want to put that data to use. 
The initial graph you create might look something like Figure 9.12. 


How satisfied have you been with each of these features? 


■ Have not used ■ Not satisfied at all ■ Not very satisfied ■ Somewhat satisfied Very satisfied ■ Completely satisfied 


Feature A 
Feature B 
Feature C 
Feature D 
Feature E 
Feature F 
Feature G 
Feature H 
Feature I 
Feature J 
Feature K 
Feature L 
Feature M 
Feature N 
Feature O 



FIGURE 9.12 User satisfaction, original graph 
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This is a real example, and Figure 9.12 shows the actual graph that 
was created for this purpose, with the exception that I've replaced 
the descriptive feature names with Feature A, Feature B, and so on. 
There is an order here—if we stare at the data for a bit, we find that 
it is arranged in decreasing order of the "Very satisfied" group plus 
the "Completely satisfied" group (the teal and dark teal segments 
on the right side of the graph). This may suggest that is where we 
should pay attention. But from a color standpoint, my eyes are drawn 
first to the bold black "Flave not used" segment. And if we pause 
to think about what the data shows, it would perhaps be the areas 
of dissatisfaction that would be of most interest. 

Part of the challenge here is that the story—the "so what"—of this 
visual is missing. We could tell a number of different stories and focus 
on a number of different aspects of this data. Let's look at a couple 
of ways to do this, with an eye towards leveraging order. 

First, we could think about highlighting the positive story: where our 
users are most satisfied. See Figure 9.13. 


Features A & B top user satisfaction 


Product X User Satisfaction: Features 


■ Completely satisfied ■ Very satisfied ■ Somewhat satisfied ■ Not very satisfied ■ Not satisfied at all Have not used 


Feature A 
Feature B 
Feature C 
Feature D 
Feature E 
Feature F 
Feature G 
Feature H 
Feature I 
Feature J 
Feature K 
Feature L 
Feature M 
Feature N 
Feature O 



Responses based on survey question "How satisfied have you been with each of these features?". 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 
Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 

FIGURE 9.13 Highlight the positive story 
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In Figure 9.13, I've ordered the data clearly by putting "Completely 
satisfied" plus "Very satisfied" in descending order—the same as in 
the original graph—but I've made it much more obvious here through 
other visual cues (namely, color, but also the positioning of the seg¬ 
ments as the first series in the graph, so the audience's attention hits 
it first as they scan from left to right). I've also used words to help 
explain why your attention is drawn to where it is via the action title 
at the top, which calls out what you should be seeing in the visual. 

We can leverage these same tactics—order, color, placement, and 
words—to highlight a different story within this data: where users 
are least satisfied. See Figure 9.14. 


Users least satisfied with Features N & J 


Product X User Satisfaction: Features 

■ Not satisfied at all ■ Not very satisfied ■ Somewhat satisfied ■ Very satisfied ■ Completely satisfied Have not used 

Feature N 
Feature J 
Feature M 
Feature C 
Feature G 
Feature I 
Feature E 
Feature H 
Feature O 
Feature F 
Feature D 
Feature K 
Feature L 
Feature B 
Feature A 

Responses based on survey question "How satisfied have you been with each of these features?". 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 

Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 

FIGURE 9.14 Highlight dissatisfaction 

Or perhaps the real story here is in the unused features, which could 
be highlighted as shown in Figure 9.15. 
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Feature O is least used 


Product X User Satisfaction: Features 


■ Have not used ■ Not satisfied at all ■ Not very satisfied ■ Somewhat satisfied Very satisfied n Completely satisfied 


Feature O 
Feature M 
Feature K 
Feature L 
Feature N 
Feature I 
Feature G 
Feature F 
Feature H 
Feature D 
Feature E 
Feature C 
Feature J 
Feature A 
Feature B 







Responses based on survey question "How satisfied have you been with each of these features?''. 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 
Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 

FIGURE 9.15 Focus on unused features 


Note that in Figure 9.15, you can still get to the differing levels of satis¬ 
faction (or lackthereof) within each bar, but they've been pushed back 
to a second-order comparison due to the color choices I've made, 
while the relative rank ordering of the "Have not used" segment is 
the clear primary comparison on which my audience is meant to focus. 

If we want to tell one of the above stories, we can leverage order, 
color, position, and words as I've shown to draw our audience's atten¬ 
tion to where we want them to pay it in the data. If we want to tell all 
three stories, however, I'd recommend a slightly different approach. 

It isn't very nice to get your audience familiar with the data only to 
completely rearrange it. Doing so creates a mental tax—the same 
sort of unnecessary cognitive burden that we discussed in Chapter 3 
that we want to avoid. Let's create a base visual and preserve the 
same order so our audience only has to familiarize themselves with 
the detail once—highlighting the different stories one at a time 
through strategic use of color. 
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User satisfaction varies greatly by feature 


Product X User Satisfaction: Features 


Have not used ■ Not satisfied at all ■ Not very satisfied ■ Somewhat satisfied Very satisfied Completely satisfied 


Feature A 
Feature B 
Feature C 
Feature D 
Feature E 
Feature F 
Feature G 
Feature H 
Feature I 
Feature J 
Feature K 
Feature L 
Feature M 
Feature N 
Feature O 
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Responses based on survey question "How satisfied have you been with each of these features?". 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 

Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 

FIGURE 9.16 Set up the graph 

Figure 9.16 depicts our base visual, without anything highlighted. 
If I were presenting this to an audience, I'd use this version to 
walk them through what they are looking at: survey responses 
to the question, "How satisfied have you been with each of these 
features?"—ranging from the positive "Completely satisfied" at the 
right to "Not satisfied at all" and, finally, "Have not used" at the far 
left (leveraging the natural association of positive at the right and 
negative at the left). Then I'd pause to tell each of the stories in 
succession. 

First comes a visual similar to what we started with in the last series 
that highlights where users are the most satisfied. In this version, I've 
leveraged different shades of blue to draw attention not only to the 
proportion of users who are satisfied but specifically to Features A 
and B within those segments that rank highest, tying these bars visu¬ 
ally to the text that illustrates my point. See Figure 9.17. 
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User satisfaction varies greatly by feature 


Product X User Satisfaction: Features 


■ Have not used ■ Not satisfied at ail ■ Not very satisfied ■ Somewhat satisfied Very satisfied Completely satisfied 


Feature A 
Feature B 
Feature C 
Feature D 
Feature E 
Feature F 
Feature G 
Feature H 
Feature I 
Feature J 
Feature K 
Feature L 
Feature M 
Feature N 
Feature O 
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Responses based on survey question "How satisfied have you been with each of these features?". 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 
Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 


FIGURE 9.17 Satisfaction 


This is followed by a focus on the other end of the spectrum to where 
users are least satisfied, again calling out and highlighting specific 
points of interest. See Figure 9.18. 


User satisfaction varies greatly by feature 


Product X User Satisfaction: Features 


■ Have not used ■ Not satisfied at all ■ Not very satisfied Somewhat satisfied Very satisfied Completely satisfied 


Feature A 
Feature B 
Feature C 
Feature D 
Feature E 
Feature F 
Feature G 
Feature H 
Feature I 
Feature J 
Feature K 
Feature L 
Feature M 
Feature N 
Feature O 
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Users are least 
satisfied with 
Features J and N; 
what improvements 
can we make here 
for a better user 
experience? 


Responses based on survey question "How satisfied have you been with each of these features?". 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 
Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 


FIGURE 9.18 Dissatisfaction 
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Note how it isn't as easy to see the relative rank ordering of the 
features highlighted in Figure 9.18 as it was when they were put in 
descending order (Figure 9.14) because they aren't aligned along 
a common baseline to either the left or the right. We can still rel¬ 
atively quickly see the primary areas of dissatisfaction (Features J 
and N) since they are so much bigger than the other categories and 
because of the color emphasis. I've added a callout boxto highlight 
this through text as well. 

Finally, preserving the same order, we can draw our audience's atten¬ 
tion to the unused features. See Figure 9.19. 


User satisfaction varies greatly by feature 


Product X User Satisfaction: Features 


Have not used ■ Not satisfied at all ■ Not very satisfied ■ Somewhat satisfied Very satisfied Completely satisfied 


Feature A 
Feature B 
Feature C 
Feature D 
Feature E 
Feature F 
Feature G 
Feature H 
Feature I 
Feature J 
Feature K 
Feature L 
Feature M 
Feature N 
Feature O 
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Responses based on survey question “How satisfied have you been with each of these features?". 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 
Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 


FIGURE 9.19 Unused features 


In Figure 9.19, it is easier to see the rank ordering (even though 
the categories aren't monotonically increasing from top to bottom) 
because of the alignment to a consistent baseline at the left of the 
graph. Here, we want our audience to focus mainly on the very bot¬ 
tom feature in the graph—Feature O. Since we're trying to preserve 
the established order and can't do this by putting it at the top (where 
the audience would encounter it first), the bold color and callout box 
help draw attention to the bottom of the graph. 
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The preceding views show the progression I'd use in a live presenta¬ 
tion. The sparing and strategic use of color lets me direct my audi¬ 
ence's attention to one component of the data at a time. If you are 
creating a written document to be shared directly with your audi¬ 
ence, you might compress all of these views into a single, compre¬ 
hensive visual, as shown in Figure 9.20. 


User satisfaction varies greatly by feature 


Product X User Satisfaction: Features 


« Have not used ■ Not satisfied at all ■ Not very satisfied ■ Somewhat satisfied a Very satisfied ■ Completely satisfied 


Feature A 
Feature B 
Feature C 
Feature D 
Feature E 
Feature F 
Feature G 
Feature H 
Feature I 
Feature J 
Feature K 
Feature L 
Feature M 
Feature N 
Feature O 


EZ3I 

■X‘i 


ESOi 


Features A and B 
continue to top user 


Users are least 
satisfied with 
Features J and N; 
what improvements 
can we make here 
for a better user 
experience? 


Feature O is least 
used. What steps 
can we proactively 
take with existing 
users to increase 
utilization? 


Responses based on survey question "How satisfied have you been with each of these features?". 

Need more details here to help put this data into context: How many people completed survey? What proportion of users does this represent? 
Do those who completed survey look like the overall population, demographic-wise? When was the survey conducted? 


FIGURE 9.20 Compreh ensive visua 


When I process Figure 9.20, my eyes do a number of zigzagging 
"z's" across the page. First, I see the bold "Features" in the graph 
title. Then I'm drawn to the dark blue bars—which I follow across to 
the dark blue text box that tells me what's interesting about what 
I'm looking at (you'll note my text here is mostly descriptive, mainly 
due to the anonymity of the example; ideally this space would be 
used to provide greater insight). Next, I hit the orange text box, read 
it, and glance back leftward to see the evidence in the graph that 
supports it. Finally, I see the teal bar emphasized at the bottom and 
look across to see the text that describes it. Strategic use of color 
sets the various series apart from one another while also making it 
clear where the audience should look for the specific evidence of 
what is being described in the text. 
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Note that with Figure 9.20 it is harder for your audience to form other 
conclusions with the data, since attention is drawn so strongly to the 
particular points I want to highlight. But as we've discussed repeat¬ 
edly, once you've reached the point of needing to communicate, 
there should be a specific story or point that you want to highlight, 
rather than let your audience draw their own conclusions. Figure 9.20 
is too dense for a live presentation but could work well for the docu¬ 
ment that will be circulated. 

I've mentioned this previously but would feel remiss not to point 
out that in some cases there is intrinsic order in the data you want 
to show (ordinal categories). For example, instead of features, if the 
categories were age ranges (0-9,10-19,20-29, etc.), you should keep 
those categories in numerical order. This provides an important con¬ 
struct forthe audience to use as they interpret the information. Then 
use the other methods of drawing attention (through color, position, 
callout boxes with text) to direct the audience's attention to where 
you want them to pay it. 

Bottom line: there should be logic in the order of the data you show. 


CASE STUDY 4: Strategies for avoiding the 
spaghetti graph 

While I very much enjoy food, I have a distaste for any chart type 
that has food in its title. My hatred of pie charts is well documented. 
Donuts are even worse. Here is another to add to the list: the spa¬ 
ghetti graph. 

If you aren't sure if you've seen a spaghetti graph before, I'll bet that 
you have. A spaghetti graph is a line graph where the lines overlap 
a lot, making it difficult to focus on a single series at a time. They 
look something like Figure 9.21. 
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Types of non-profits supported by area funders 

Arts & culture Education Health Human services Other 

J2 100% i 



Data is self-reported by funders; percents sum to greater than 100 because respondents can make multiple selections. 

FIGURE 9.21 The spaghetti graph 

Graphs like Figure 9.21 are known as spaghetti graphs because they 
look like someone took a handful of uncooked spaghetti noodles 
and threw them on the ground. And they are about as informative 
as those haphazard noodles would be as well ... 


which is to say ... 


not at all. 


Note how difficult it is to concentrate on a single line within that 
mess, due to all of the crisscrossing and because so much is com¬ 
peting for your attention. 

There are a few strategies for taking the would-be-spaghetti graph 
and creating more visual sense of the data. I'll coverthree such strat¬ 
egies and show them applied in a couple of different ways to the 
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data graphed in Figure 9.21, which shows types of nonprofits sup¬ 
ported by funders in a given area. First, we'll look at an approach 
you should be familiar with by now: using preattentive attributes to 
emphasize a single line at a time. After that, we'll look at a couple of 
views that separate the lines spatially. Then finally, we'll look at a com¬ 
bined approach that leverages elements of these first two strategies. 


Emphasize one line at a time 

One way to keep the spaghetti graph from becoming visually over¬ 
whelming is to use preattentive attributes to draw attention to a sin¬ 
gle line at a time. For example, we could focus our audience on the 
increase in the percentage of funders donating over time to health 
nonprofits. See Figure 9.22. 


Types of non-profits supported by area funders 



Data is self-reported by funders; percents sum to greater than 100 because respondents can make multiple selections. 

FIGURE 9.22 Emphasize a single line 

Or we could use the same strategy to emphasize the decrease in 
the percentage of funders donating to education-related nonprof¬ 
its. See Figure 9.23. 
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Types of non-profits supported by area funders 



Data is self-reported by funders; percents sum to greater than 100 because respondents can make multiple selections. 

FIGURE 9.23 Emphasize another single line 

In Figures 9.22 and 9.23, color, thickness of line, and added marks 
(the data marker and data label) act as visual cues to draw attention 
to where we want our audience to focus. This strategy can work well 
in a live presentation, where you explain the details of the graph 
once (as we've seen in the recent case studies), then cycle through 
the various data series in this manner, highlighting what is interest¬ 
ing or should be paid attention to with each and why. Note that we 
need either this voiceover or the addition of text to make it clear 
why we are highlighting the given data and provide the story for 
our audience. 


Separate spatially 

We can untangle the spaghetti graph by pulling the lines apart either 
vertically or horizontally. First, let's look at a version where the lines 
are pulled apart vertically. See Figure 9.24. 
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Types of non-profits supported by area funders 

2010 2011 2012 2013 2014 2015 | % of funders 

67 % 

Health 



Education 



60% 


Human 

services 



« 55% 


Arts & 
culture 



43% 


Other 



30% 


Data is self-reported by funders; percents sum to greater than 100 because respondents can make multiple selections. 

FIGURE 9.24 Pull the lines apart vertically 

In Figure 9.24, the same x-axis (year, shown at the top) is leveraged 
across all of the graphs. In this solution, I've created five separate 
graphs but organized them such that they appear to be a single 
visual. The y-axis within each graph isn't shown; rather, the starting 
and ending point labels are meant to provide enough context so 
that the axis is unnecessary. Though they aren't shown, it is important 
that the y-axis minimum and maximum are the same for each graph 
so the audience can compare the relative position of each line or 
point within the given space. If you were to shrink these down, they 
would look similar to what Edward Tufte calls "sparklines" (a very 
small line graph typically drawn without axis or coordinates to show 
the general shape of the data; Beautiful Evidence, 2006). 

This approach assumes that being able to see the trend for a given 
category (Health, Education, etc.) is more important than comparing 
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the values across categories. If that isn't the case, we can consider 
pulling the data apart horizontally, as illustrated in Figure 9.25. 


Types of non-profits supported by area funders 



Data is self-reported by funders; percents sum to greater than 100 because respondents can make multiple selections. 

FIGURE 9.25 Pull the lines apart horizontally 


Whereas in Figure 9.24 we leveraged the x-axis (years) across the five 
categories, in Figure 9.25 we leverage the same y-axis (percent of 
funders) across the five categories. Here, the relative height of the 
various data series allows them to more easily be compared with each 
other. We can quickly see that the highest percentage of funders in 
2015 donate to Health, a lower percentage to Education, an even 
lower percentage to Human Services, and so on. 


Combined approach 

Another option is to combine the approaches we've outlined so 
far. We can separate spatially and emphasize a single line at a time, 
while leaving the others there for comparison but pushing them to 
the background. As was the case with the prior approach, we can 
do this by separating the lines vertically (Figure 9.26) or horizontally 
(Figure 9.27). 
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Types of non-profits supported by area funders 

2010 2011 2012 2013 2014 2015 | % of funders 

67% 

Health 




Education 



Data is self-reported by funders; percents sum to greater than 100 because respondents can make multiple selections. 

FIGURE 9.26 Combined approach, with vertical separation 


Types of non-profits supported by area funders 


Health Education 




2010 


2015 


Human 

services 



2010 2015 


Arts & Other 

culture 


2010 2015 2010 2015 




Data is self-reported by funders; percents sum to greater than 100 because respondents can make multiple selections. 

FIGURE 9.27 Combined approach, with horizontal separation 
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Having a number of small graphs together, as shown in Figure 9.27, 
is sometimes referred to as "small multiples." As noted previously, 
it's imperative here that the details of each graph (the x- and y-axis 
minimum and maximum) are the same so that the audience can 
quickly compare the highlighted series across the various graphs. 

This approach, shown in Figures 9.26 and 9.27, can work well if the 
context of the full dataset is important but you want to be able to 
focus on a single line at a time. Because of the denseness of infor¬ 
mation, this combined approach may work better for a report or 
presentation that will be circulated rather than a live presentation, 
where it will be more challenging to direct your audience where you 
want them to look. 

As is frequently the case, there is not a single "right" answer. Rather, 
the solution that will work best will vary by situation. The meta-lesson 
is: if you find yourself facing a spaghetti graph, don't stop there. Think 
about what information you want to most convey, what story you 
want to tell, and what changes to the visual could help you accom¬ 
plish that effectively. Note that in some cases, this may mean show¬ 
ing less data altogether. Ask yourself: Do I need all categories? All 
years? When appropriate, reducing the amount of data shown can 
make the challenge of graphing data like that shown in this exam¬ 
ple easier as well. 

CASE STUDY 5: Alternatives to pies 

Recall the scenario we discussed in Chapter 1 about the summer 
learning program on science. To refresh your memory: you just com¬ 
pleted a pilot summer program on science aimed at improving per¬ 
ceptions of the field among 2nd and 3rd grade elementary children. 
You conducted a survey going into the program and at the end of 
the program, and want to use this data as evidence of the success 
of the pilot program in your request for future funding. Figure 9.28 
shows a first attempt at graphing this data. 
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Survey results: summer learning program on science 


PRE: How do you feel 
about doing science? 


POST: How do you feel 
about doing science? 


■ Bored ■ Not great ■ OK ■ Kind of interested ■ Excited 


■ Bored ■ Not great ■ OK ■ Kind of interested ■ Excited 



FIGURE 9.28 Original visual 

The survey data demonstrates that, on the basis of improved senti¬ 
ment toward science, the pilot program was a great success. Going 
into the program, the biggest segment of students (40%, the green 
slice in Figure 9.28, left) felt just "OK" about science—perhaps they 
hadn't made up their minds one way orthe other. However, afterthe 
program (Figure 9.28, right), we see the 40% in green shrinks down 
to 14%. "Bored" (blue) and "Not great" (red) went up a percentage 
point each, but the majority of the change was in a positive direction. 
After the program, nearly 70% of kids (purple plus teal segments) 
expressed some level of interest toward science. 

Figure 9.28 does this story a great disservice. I shared my less- 
than-favorable view on pie charts in Chapter 2, so I hope this judg¬ 
ment is not met with surprise. Yes, you can get to the story from 
Figure 9.28, but you have to work for it and overcome the annoyance 
of trying to compare segments across two pies. As we've discussed, 
we want to limit or eliminate the work your audience has to do to 
get at the information, and we certainly don't want to annoy them. 
We can avoid such challenges by choosing a different type of visual. 

Let's take a look at four alternatives for displaying this data—show 
the numbers directly, simple bar graph, stacked horizontal bar graph, 
and slopegraph—and discuss some considerations with each. 
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Alternative #1: show the numbers directly 

If the improvement in positive sentiment is the main message we 
want to impart to our audience, we can consider making that the 
only thing we communicate. See Figure 9.29. 


Pilot program was a success 


After the pilot program, 

68 % 

of kids expressed interest towards science, 

compared to 44% going into the program. 

Based on survey of 100 students conducted before and after pilot program (100% response rate on both surveys). 

FIGURE 9.29 Show the numbers directly 

Too often, we think we have to include all of the data and overlook 
the simplicity and power of communicating with just one or two num¬ 
bers directly, as demonstrated in Figure 9.29. That said, if you feel 
you need to show more, look to one of the following alternatives. 


Alternative #2: simple bar graph 

When you want to compare two things, you should generally put 
those two things as close together as possible and align them along 
a common baseline to make this comparison easy. The simple bar 
graph does this by aligning the Before and After survey responses 
with a consistent baseline at the bottom of the graph. See Figure 9.30. 
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How do you feel about science? 


BEFORE program, the 
majority of children felt 
just OK about science. 



AFTER 

program, 


more children 
were Kind of 
interested & 
Excited about 
science. 


11 % 



5% 



Bored Not great OK Kind of Excited 

interested 


Based on survey of 100 students conducted before and after pilot program (100% response rate on both surveys). 


FIGURE 9.30 Simple bar graph 

I am partial to this view for this specific example because the layout 
makes it possible to put the text boxes right next to the data points 
they describe (note that other data is there for context but is slightly 
pushed to the background through the use of lighter colors). Also, 
by having Before and After as the primary classification, I'm able to 
limit the visual to two colors—grey and blue—whereas three colors 
will be used in the following alternatives. 

Alternative #3: 100% stacked horizontal bar graph 

When the part-to-whole concept is important (something you don't 
get with either Alternative #1 or #2), the stacked 100% horizontal 
bar graph achieves this. See Figure 9.31. Here, you get a consistent 
baseline to use for comparison at the left and at the right of the 
graph. This allows the audience to easily compare both the negative 
segments at the left and the positive segments at the right across 
the two bars and, because of this, is a useful way to visualize survey 
data in general. 
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Pilot program was a success 


How do you feel about science? 

Bored I Not great I OK I Kind of interested I Excited 

% of total 

0% 20% 40% 60% 80% 100% 


BEFORE 


AFTER 

BEFORE program, the AFTER program, more children 

majority of children (40%) were Kind of interested (30%) 

felt just OK about science. & Excited (38%) about science. 

Based on survey of 100 students conducted before and after pilot program (100% response rate on both surveys). 

FIGURE 9.31 100% stacked horizontal bar graph 

In Figure 9.31,1 chose to retain the x-axis labels rather than put data 
labels on the bars directly. I tend to do it this way when leveraging 
100% stacked bars so that you can use the scale at the top to read 
either from left to right or from right to left. In this case, it allows 
us to attribute numbers to the change from Before to After on the 
negative end of the scale ("Bored" and "Not great") or from right 
to left, doing the same for the positive end of the scale ("Kind of 
interested" and "Excited"). In the simple bar graph shown previously 
(Figure 9.30), I chose to omit the axis and label the bars directly. This 
illustrates how different views of your data may lead you to different 
design choices. Always think about how you want your audience to 
use the graph and make your design choices accordingly: different 
choices will make sense in different situations. 



Alternative #4: slopegraph 

The final alternative I'll present here is a slopegraph. As was the case 
with the simple bar chart, you don't get a clear sense of there being 
a whole and thus pieces-of-a-whole in this view (in the way that you 
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do with the initial pie or with the 100% horizontal stacked bar). Also, 
if it is important to have your categories ordered in a certain way, a 
slopegraph won't always be ideal since the various categories are 
placed according to the respective data values. In Figure 9.32 on the 
right-hand side, you do get the positive end of the scale at the top, 
but note that "Bored" and "Not great" at the bottom are switched 
relative to how they'd appear in an ordinal scale because of the val¬ 
ues that correspond with these points. If you need to dictate the 
category order, use the simple bar graph or the 100% stacked bar 
graph, where you can control this. 


Pilot program was a success 


How do you feel about science? 



BEFORE program, the 
majority of children felt 
just OK about science. 


more children were 
Kind of interested & 
Excited about science. 


BEFORE 


AFTER 


Based on survey of 100 students conducted before and after pilot program (100% response rate on both surveys). 

FIGURE 9.32 Slopegraph 


With the slopegraph in Figure 9.32, you can easily see the visual per¬ 
centage change from Before to After for each category via the slope 
of the respective line. It's easy to see quickly that the category that 
increased the most was "Excited" (due to the steep slope) and the 
category that decreased markedly was "OK." The slopegraph also 
provides clear visual ordering of categories from greatest to least 
(via their respective points in space from top to bottom on the left 
and right sides of the graph). 
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Any of these alternatives might be the best choice given the specific 
situation, how you want your audience to interact with the informa¬ 
tion, and what point or points of emphasis you want to make. The 
big lesson here is that you have a number of alternatives to pies that 
can be more effective for getting your point across. 


In closing 

In this chapter, we discussed considerations and solutions for tack¬ 
ling several common challenges faced when communicating visually 
with data. Inevitably, you'll face data visualization challenges that I 
have not addressed. There is as much to be learned from the critical 
thinking that goes into solving some of these scenarios as there is 
from the "answer" itself. As we've discussed, when it comes to data 
visualization, rarely is there a single correct path or solution. 


Even more examples 


F or more case studies like the ones we've considered here, 
check out my blog at storytellingwithdata.com, where 
you'll find a number of before-and-after examples leveraging 
the lessons that we've learned. 


When you find yourself in a situation where you are unsure how to 
proceed, I nearly always recommend the same strategy: pause to 
consider your audience. What do you need them to know or do? 
What story do you aim to tell them? Often, by answering these ques¬ 
tions, a good path for how to present your data will become clear. 
If one doesn't, try several views and seek feedback. 

My challenge to you is to consider how you can apply all of the les¬ 
sons we've learned and your critical thinking skills to the various and 
varied data visualization challenges you face. The responsibility— 
and the opportunity—to tell a story with data is yours. 
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final thoughts 


Data visualization—and communicating with data in general—sits at 
the intersection of science and art. There is certainly some science 
to it: best practices and guidelines to follow, as we've discussed 
throughout this book. But there is also an artistic component. This 
is one of the reasons this area is so much fun. It is inherently diverse. 
Different people will approach things in varying ways and come up 
with distinct solutions to the same data visualization challenge. As 
we've discussed, there is no single "right" answer. Rather, there are 
often multiple potential paths for communicating effectively with 
data. Apply the lessons we've covered in this book to forge your 
path, with the goal of using your artistic license to make the infor¬ 
mation easier for your audience to understand. 

You have learned a great deal over the course of this book that sets 
you up for success when it comes to communicating effectively with 
data. In this final chapter, we'll discuss some tips on where to go from 
here and strategies for upskilling storytelling with data competency 
in your team and organization. Finally, we will end with a recap of 
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the main lessons we've covered and send you off eager and ready 
to tell stories with data. 


Where to go from here 

Reading about effective storytelling with data is one thing. But how 
do you translate what we've learned to practical application? The sim¬ 
ple way to get good at this is to do it: practice, practice, and practice 
some more. Look for opportunities in yourworkto apply the lessons 
we've learned. Note that it doesn't have to be all or nothing—one 
way to make progress is through incremental improvements to exist¬ 
ing or ongoing work. Consider also when you can leverage the entire 
storytelling with data process that we've covered from start to finish. 


Now I want to overhaul our entire monthly report! 


Y ou likely see graphs differently than you did at the onset 
of our journey together. Rethinking the way you visual¬ 
ize data is a great thing. But don't let overambitious goals 
overwhelm and hinder progress. Consider what incremental 
improvements you can make as you work toward storytelling 
with data nirvana. For example, if you're considering over¬ 
hauling your regular reports, an interim step could be to start 
thinking of the report as the appendix. Leave the data there 
for reference, but push it to the back so it doesn't distract 
from the main message. Insert a few slides or a cover note up 
front and use this to pull out the interesting stories, leverag¬ 
ing the storytelling with data lessons we've covered. This way 
you can more easily focus your audience on the important 
stories and resulting actions. 


For some specific, concrete steps on where to go from here, I'll out¬ 
line five final tips: learn your tools well, iterate and seek feedback, 
allow ample time for this part of the process, seek inspiration from 
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others, and—last but not least—have some fun while you're at it! 
Let's discuss each of these. 


Tip #1: learn your tools well 

For the most part, I've intentionally avoided discussion on tools 
because the lessons we've covered are fundamental and can be 
appliedto varying degrees in anytool (for example, Excel or Tableau). 
Try not to let your tools be a limiting factor when it comes to com¬ 
municating effectively with data. Pick one and get to know it as best 
you can. When you're first starting out, a course to become familiar 
with the basics may be helpful. In my experience, however, the best 
way to learn a tool is to use it. When you can't figure out how to do 
something, don't give up. Continue to play with the program and 
search Google for solutions. Any frustration you encounter will be 
worth it when you can bend your tool to your will! 

You don't need fancy tools in order to visualize data well. The exam¬ 
ples we've looked at in this book were all created with Microsoft 
Excel, which I find is the most pervasive when it comes to business 
analytics. 

While I use mainly Excel forvisualizing data, this isn'tyouronly option. 
There are a plethora of tools out there. The following is a very quick 
rundown of some of the popular ones currently used for creating 
data visualizations like the ones we've examined: 

• Google spreadsheets are free, online, and sharable, allowing 
multiple people to edit (as of this writing, there remain graph for¬ 
matting constraints that make it challenging to apply some of the 
lessons we've covered when it comes to decluttering and drawing 
attention where you want it). 

• Tableau is a popular out-of-the-box data visualization solution 
that can be great for exploratory analysis because it allows you to 
quickly create multiple views and nice-looking graphs from your 
data. It can be leveraged for the explanatory via the Story Points 
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feature. It is expensive, though a free Tableau Public option is 
available if uploading your data to a public server isn't an issue. 

• Programming languages—like R, D3 (JavaScript), Processing, and 
Python —have a steeper learning curve but allow for greater flex¬ 
ibility, since you can control the specific elements of the graphs you 
create and make those specifications repeatable through code. 

• Some people use Adobe Illustrator, either alone ortogetherwith 
graphs created in an application like Excel or via a programming 
language, for easier manipulation of graph elements and a pro¬ 
fessional look and feel. 


How I use PowerPoint 


F or me, PowerPoint is simply the mechanism that allows 
me to organize a handout or present on the big screen. 

I nearly always start from a totally blank slide and do not 
leverage the built-in bullets that too easily turn content from 
presentation to teleprompter. 

You can build graphs directly in PowerPoint; however, I tend 
not to do this. There is greater flexibility in Excel (where, in 
addition to the graph, you can also have some elements of 
a visual—for example, titles or axis labels—directly in the 
cells, which is sometimes useful). Because of this, I create my 
visuals in Excel, then copy and paste into PowerPoint as an 
image. If I am using text together with a visual—for example, 
to draw attention to a specific point—I typically do that via a 
text box in PowerPoint. 

The animation feature within PowerPoint can be useful for 
progressing through a story with iterations of the same visual, 
as shown in Chapter 8 or some of the case studies in Chap¬ 
ter 9. When using animation in PowerPoint, use only simple 
Appear or Disappear (in some instances, Transparency can 
also be useful); steer clear of any animation that causes ele¬ 
ments to fly in or fade out—this is the presentation software 
equivalent of 3D graphs—unnecessary and distracting! 
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Another essential basictool forvisualizing data that I did not include 
in the preceding list is paper— which brings me to my next tip. 


Tip #2: iterate and seek feedback 

I've presented the storytelling with data process as a linear path. 
That's not often the case in reality. Rather, it takes iterating to get 
from early ideas to a final solution. When the best course forvisual¬ 
izing certain data is unclear, start with a blank piece of paper. This 
enables you to brainstorm without the constraints of your tools or 
what you know how to do in your tools. Sketch out potential views 
to see them side-by-side and determine what will work best for 
getting your message across to your audience. I find that we form 
less attachment to our work product—which can make iterating 
easier—when we are working on paper rather than on our comput¬ 
ers. There is also something freeing about drawing on blank paper 
that may make it easier to identify new approaches if you're feeling 
stuck. Once you have your basic approach sketched, consider what 
you have at your disposal—tools, or internal or external experts—to 
actually create the visual. 

When creating your visual in your graphing application (for example, 
Excel) and refining to get from good to great, you can leverage what 
I call the "optometrist approach." Create a version of the graph (let's 
call it A), then make a copy of it (B) and make a single change. Then 
determine which looks better—A or B. Often, the practice of see¬ 
ing slight variations next to each other makes it quickly clear which 
view is superior. Progress in this manner, preserving the latest "best" 
visual and continuing to make minor modifications in a copy (so you 
always have the prior version to go back to in case the modification 
worsens it) to iterate toward your ideal visual. 

At any point, if the best path is unclear, seek feedback. The fresh set 
of eyes that a friend or colleague can bring to the data visualization 
endeavor is invaluable. Show someone else your visual and have 
them talk you through their thought process: what they pay atten¬ 
tion to, what observations they make, what questions they have, 
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and any ideas they may have for better getting your point across. 
These insights will let you know if the visual you've created is on the 
mark or, in the case when it isn't, give you an idea of where to make 
changes and focus continued iteration. 

When it comes to iterating, there is one thing you need perhaps 
more than anything else in order to be successful: time. 


Tip #3: devote time to storytelling with data 

Everything we've discussed throughout this book takes time. It takes 
time to build a robust understanding of the context, time to under¬ 
stand what motivates our audience, time to craft the 3-minute story 
and form the Big Idea. It takes time to look at the data in different 
ways and determine how to best show it. It takes time to declutter 
and draw attention and iterate and seek feedback and iterate some 
more to create an effective visual. It takes time to pull it all together 
into a story and form a cohesive and captivating narrative. 

It takes even more time to do all of this well. 

One of my biggest tips forsuccess in storytelling with data is to allow 
adequate time for it. If we don't consciously recognize that this takes 
time to do well and budget accordingly, our time can be entirely 
eaten up by the other parts of the analytical process. Consider the 
typical analytical process: you start with a question or hypothesis, 
then you collect the data, then you clean the data, and then you ana¬ 
lyze the data. After all of that, it can be tempting to simply throw the 
data into a graph and call it "done." 

But we simply aren't doing ourselves—or our data—justice with this 
approach. The default settings of our graphing application are typi¬ 
cally far from ideal. Our tools do not know the story we aim to tell. 
Combine these two things and you run the risk of losing a great deal 
of potential value—including the opportunity to drive action and 
effect change—if adequate time isn't spent on this final step in the 
analytical process: the communication step. This is the only part of 
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the entire process that your audience actually sees. Devote time to 
this important step. Expect it to take longer than you think to allow 
sufficient time to iterate and get it right. 


Tip #4: seek inspiration through good examples 

Imitation really is the best form of flattery. If you see a data visualiza¬ 
tion or example of storytelling with data that you like, consider how 
you might adapt the approach for your own use. Pause to reflect on 
what makes it effective. Make a copy of it and create a visual library 
that you can add to over time and refer to for inspiration. Emulate 
the good examples and approaches that you see. 

Said more provocatively—imitation is a good thing. We learn by 
emulating experts. That's why you see people with their sketchpads 
and easels at art museums—they are interpreting great works. My 
husband tells me that while learning to play the jazz saxophone, he 
would listen to the masters repeatedly—narrowing at times to a sin¬ 
gle measure played at a slower speed that he would practice until 
he could repeat the notes perfectly. This idea of using great exam¬ 
ples as an archetype to learn applies to data visualization as well. 

There are a number of great blogs and resources on the topic of data 
visualization and communicating with data that contain many good 
examples. Here are a few of my current personal favorites (includ¬ 
ing my own!): 

• Eager Eyes (eagereyes.org, Robert Kosara): Thoughtful content 
on data visualization and visual storytelling. 

• FiveThirtyEight's Data Lab (fivethirtyeight.com/datalab, various 
authors): I like their typically minimalist graphing style on a large 
range of news and current events topics. 

• Flowing Data (flowingdata.com, Nathan Yau): Membership gets 
you premium content, but there are a lot of great free examples 
of data visualization as well. 
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• The Functional Art (thefunctionalart.com, Alberto Cairo): An intro¬ 
duction to information graphics and visualization, with great con¬ 
cise posts highlighting advice and examples. 

• The Guardian Data Blog (theguardian.com/data, various authors): 
News-related data, often with accompanying article and visualiza¬ 
tions, by the British news outlet. 

• HelpMeViz (HelpMeViz.com, Jon Schwabish): "Helping people 
with everyday visualizations," this site allows you to submit a visual 
to receive feedback from readers or scan the archives for exam¬ 
ples and corresponding conversations. 

• Junk Charts (junkcharts.typepad.com, Kaiser Fung): By 
self-proclaimed "web's first data viz critic," focuses on what makes 
graphics work and how to make them better. 

• Make a Powerful Point (makeapowerfulpoint.com, Gavin 
McMahon): Fun, easy-to-digest content on creating and giving 
presentations and presenting data. 

• Perceptual Edge (perceptualedge.com, Stephen Few): No- 
nonsense content on data visualization for sensemaking and 
communication. 

• Visualising Data (visualisingdata.com, Andy Kirk): Charts the 
development of the data visualization field, with great monthly 
"best visualisations of the web" resource list. 

• VizWiz (vizwiz.blogspot.com, Andy Kriebel): Data visualization 
best practices, methods for improving existing work, and tips and 
tricks for using Tableau Software. 

• storytelling with data (storytellingwithdata.com): My blog focuses 
on communicating effectively with data and contains many exam¬ 
ples, visual makeovers, and ongoing dialogue. 

This is just a sampling. There is a lot of great content out there. I 

continue to learn from others who are active in this space and doing 

great work. You can, too! 
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Learn from the not-so-great examples, too 


O ften, you can learn as much from the poor examples of 
data visualization—what not to do—as you can from 
those that are effective. Bad graphs are so plentiful that 
entire sites exist to curate, critique, and poke fun at them. 

For an entertaining example, check out WTF Visualizations 
(wtfviz.net), where content is described simply as "visual¬ 
izations that make no sense." I challenge you not only to 
recognize when you encounter a poor example of data visu¬ 
alization but also to pause and reflect on why it isn't ideal and 
how it could be improved. 


You now have a discerning eye when it comes to the visual display 
of information. You will never look at a graph the same. One work¬ 
shop attendee told me that he is "ruined"—he can't encounter a 
data visualization without applying his new lens for assessing effec¬ 
tiveness. I love hearing these stories, as it means I'm making progress 
toward my goal of ridding the world of ineffective graphs. You have 
been ruined in this same way, but this is actually a really good thing! 
Continue to learn from and leverage the aspects of good examples 
you see, while avoiding the pitfalls of the poor ones, as you start to 
create your own data visualization style. 


Tip #5: have fun and find your style 

When most people think about data, one of the furthestthings from 
their mind is creativity. But within data visualization, there is absolutely 
space for creativity to play a role. Data can be made to be breath- 
takingly beautiful. Don't be afraid to try new approaches and play a 
little. You'll continue to learn what works and what doesn't overtime. 

You may also find that you develop a personal data visualization style. 
For example, my husband says he can recognize visuals that I cre¬ 
ated or influenced. Unless a client brand calls for something else, I 
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tend to do everything in shades of grey and use blue sparingly in a 
minimalist style, almost always in plain old Arial font (I like it!). That 
doesn't mean your approach must imitate these specifics to be suc¬ 
cessful. My own style has evolved based on personal preferences and 
learning through trial and error—testing out different fonts, colors, 
and graph elements. I can recall one particularly unfortunate example 
that incorporated a grey-to-white shaded graph background and far 
too many shades of orange. I've come a long way! 

To the extent that it makes sense given the task at hand, don't be 
afraid to let your own style develop and creativity come through when 
you communicate with data. Company brand can also play a role in 
developing a data visualization style; consider your company's brand 
and whether there are opportunities to fold that into how you visual¬ 
ize and communicate with data. Just make sure that your approach 
and stylistic elements are making the information easier—not more 
difficult—for your audience to consume. 

Now that we've looked at some specific tips for you to follow, let's 
turn to some ideas for building storytelling with data competency 
in others. 


Building storytelling with data competency in your 
team or organization 

I am a strong believer that anyone can improve their ability to com¬ 
municate with data by learning and applying the lessons we've cov¬ 
ered. That said, some will have more interest and natural aptitude 
than others in this space. When it comes to being effective at com¬ 
municating with data in your team or your organization, there are a 
few potential strategies to consider: upskill everyone, invest in an 
expert, or outsource this part of the process. Let's briefly discuss 
each of these. 
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Upskill everyone 

As we've discussed, part of the challenge is that data visualization 
is a single step in the analytical process. Those hired into analytical 
roles typically have quantitative backgrounds that suit them well for 
the other steps (finding the data, pulling it together, analyzing it, 
building models), but not necessarily any formal training in design 
to help them when it comes to the communication of the analysis. 
Also, increasingly those without analytical backgrounds are being 
asked to put on analytical hats and communicate using data. 

For both of these groups, finding ways to impart foundational knowl¬ 
edge can make everyone better. Invest in training or use the lessons 
covered here to generate momentum. On this latter note, here are 
some specific ideas: 

• Storytelling with data book club: read a chapter at a time and 
then discuss ittogether, identifying examples specific to your work 
where the given lesson can be applied. 

• Do-it-yourself workshop: after finishing the book, conduct your 
own workshop—soliciting examples of communicating with data 
from your team and discussing how they can be improved. 

• Makeover Monday: challenge individuals to a weekly makeover 
of less-than-ideal examples employing the lessons we've covered. 

• Feedback loop: set the expectation that individuals must share 
work in progress and offer feedback to each other grounded in 
the storytelling with data lessons. 

• And the winner is: introduce a monthly or quarterly contest, where 
individuals or teams can submit their own examples of effective 
storytelling with data then start a gallery of model examples, 
adding to it over time via contest winners. 

Any of these approaches—alone or combined—can create and help 
ensure continued focus on effective visualization and storytelling 
with data. 
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Invest in an internal expert or two 

Another approach is to identify an individual ora couple of individuals 
on yourteam or in your organization who are interested in data visu¬ 
alization (even better if they've already displayed some natural apti¬ 
tude) and invest in them so they can become your in-house experts. 
Make it an expectation of their role to be an internal data visualization 
consultant to whom others on the team can turn for brainstorming 
and feedback or to overcome tool-specific challenges. This invest¬ 
ment can take the form of books, tools, coaching, workshops, or 
courses. Provide time and opportunities to learn and practice. This 
can be a great form of recognition and career development for the 
individual. As the individual continues to learn, they can share this 
with others as a way to ensure continued team development as well. 


Outsource 

In some situations, it may make sense to outsource visual creation 
to an external expert. If time or skill constraints are too great to 
overcome for a specific need, turning to a data visualization or pre¬ 
sentation consultant may be worth considering. For example, one 
client contracted me to design an important presentation that they 
would need to give a number of times in the upcoming year. Once 
the basic story was in place, they knew they could make the minor 
changes needed to make it fit the various venues. 

The biggest drawback of outsourcing is that you don't develop the 
skills and learn in the same way as if you tackle the challenge inter¬ 
nally. To help overcome this, look for opportunities to learn from the 
consultant during the process. Consider whetherthe output can also 
provide a starting point for other work, or if it can be evolved over 
time as you develop internal capability. 


A combined approach 

The teams and organizations I've seen become the most success¬ 
ful in this space leverage a combined approach. They recognize 
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the importance of storytelling with data and invest in training and 
practice to give everyone the foundational knowledge for effective 
data visualization. They also identify and support an internal expert, 
to whom the rest of the team can turn for help overcoming specific 
challenges. They bring in external experts to learn from as makes 
sense. They recognize the value of being able to tell stories with 
data effectively and invest in their people to build this competency. 

Through this book, I've given you the foundational knowledge and 
language to use to help your team and your organization excel 
when it comes to communicating with data. Think about how you 
can frame feedback in terms of the lessons we've covered to help 
others improve their ability and effectiveness as well. 

Let's wrap up with a recap of the path we've taken to effective sto¬ 
rytelling with data. 


Recap: a quick look at all we've learned 

We have learned a great deal over the course of this book, from 
context to cutting clutter and drawing attention to telling a robust 
story. We've worn our designer hats and looked at things through our 
audience's eyes. Here is a review of the main lessons we've covered: 

1. Understand the context. Build a clear understanding of who 
you are communicating to, what you need them to know or do, 
how you will communicate to them, and what data you have to 
back up your case. Employ concepts like the 3-minute story, the 
Big Idea, and storyboarding to articulate your story and plan the 
desired content and flow. 

2. Choose an appropriate visual display. When highlighting a 
number or two, simple text is best. Line charts are usually best 
for continuous data. Bar charts work great for categorical data 
and must have a zero baseline. Let the relationship you want to 
show guide the type of chart you choose. Avoid pies, donuts, 3D, 
and secondary y-axes due to difficulty of visual interpretation. 
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3. Eliminate clutter. Identify elements that don't add informative 
value and remove them from your visuals. Leverage the Gestalt 
principles to understand how people see and identify candidates 
for elimination. Use contrast strategically. Employ alignment of 
elements and maintain white space to help make the interpreta¬ 
tion of your visuals a comfortable experience for your audience. 

4. Focus attention where you want it. Employ the power of pre- 
attentive attributes like color, size, and position to signal what's 
important. Use these strategic attributes to draw attention to 
where you want your audience to look and guide your audience 
through your visual. Evaluate the effectiveness of preattentive 
attributes in your visual by applying the "where are your eyes 
drawn?" test. 

5. Think like a designer. Offer your audience visual affordances as 
cues for how to interact with your communication: highlight the 
important stuff, eliminate distractions, and create a visual hier¬ 
archy of information. Make your designs accessible by not over¬ 
complicating and leveraging text to label and explain. Increase 
your audience's tolerance of design issues by making your visu¬ 
als aesthetically pleasing. Work to gain audience acceptance of 
your visual designs. 

6. Tell a story. Craft a story with clear beginning (plot), middle 
(twists), and end (call to action). Leverage conflict and tension 
to grab and maintain your audience's attention. Consider the 
order and manner of your narrative. Utilize the power of repeti¬ 
tion to help your stories stick. Employ tactics like vertical and 
horizontal logic, reverse storyboarding, and seeking a fresh per¬ 
spective to ensure that your story comes across clearly in your 
communication. 

Together, these lessons set you up for success when communicat¬ 
ing with data. 
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In closing 

When you opened this book, if you felt any sense of discomfort or 
lack of expertise when it comes to communicating with data, my 
hope is that those feelings have been mitigated. You now have a 
solid foundation, examples to emulate, and concrete steps to take 
to overcome the data visualization challenges you face. You have a 
new perspective. You will never look at data visualization the same. 
You are ready to assist me with my goal of ridding the world of inef¬ 
fective graphs. 

There is a story in your data. If you weren't convinced of that before 
our journey together, I hope you are now. Use the lessons we've 
covered to make that story clear to your audience. Help drive bet¬ 
ter decision making and motivate your audience to act. Never again 
will you simply show data. Rather, you will create visualizations that 
are thoughtfully designed to impart information and incite action. 

Go forth and tell your stories with data! 
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introduction 

I often receive emails from people who have read my first book, storytelling with 
data, or attended one of our workshops by the same name. There are notes of en¬ 
couragement, support for the work we're doing, and plenty of questions and re¬ 
quests. I especially love hearing the success stories: reports of having influenced a 
key business decision, spurred an overdue budget conversation, or prompted an 
action that positively impacted an organization's bottom line. The most inspiring 
accounts are those of personal growth and recognition. One grateful reader ap¬ 
plied storytelling with data principles during an interview, helping him land a new 
job. All of this success is the result of people from different industries, functions, 
and roles committing time to improve their ability to communicate with data. 

I also hear regularly from people who want more. They've read the book and 
understand the potential impact of telling stories with data, but struggle with the 
practical application to their own work. They have additional questions or feel 
they are facing nuanced situations that are keeping them from having the desired 
impact. It's clear that people crave more guidance and practice to help fully de¬ 
velop their data storytelling skills. 

Others reach out who are—or would like to be—teaching the lessons outlined in 
storytelling with data. In many cases, they are university instructors (it's amazing to 
think that storytelling with data is used as a textbook at more than 100 universities 
around the world!) or they are a part of a learning and development function with¬ 
in an organization, interested in building an in-house course or training program. 
There are also leaders, managers, and individual contributors who want to upskill 
their teams or provide good coaching and feedback to others. 

This book addresses all of these needs for individuals, teachers, and leaders. By 
sharing invaluable insight through many practical examples, guided practice, and 
open-ended exercises, I will help build your confidence and credibility when it 
comes to applying and teaching others to apply the storytelling with data lessons. 
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How this book is organized & what to expect 

Each chapter starts with a brief recap of the key lessons that are covered in story¬ 
telling with data. This is followed by: 

practice with Cole: exercises based on real-world examples posed for you to con¬ 
sider and solve, accompanied by detailed step-by-step illustration and explanation 

practice on your own: more exercises and thought-provoking questions for you to 
work through individually without prescribed solutions 

practice at work: thoughtful guidance and hands-on exercises for applying the 
lessons learned on the job, including practical instruction on when and how to 
solicit useful feedback and iterate to refine your work from good to great 

Much of the content you'll encounter here is inspired by our storytelling with 
data workshops. Because these sessions span many industries, so do the exam¬ 
ples upon which I'll draw. We'll navigate between different topics—from digital 
marketing to pet adoption to sales training—giving you a rich and varied set of 
situations to learn from as you hone your data storytelling skills. 

Warning: this is not a traditional book that you sit and read. To get the most out 
of it, you'll want to make it a fully interactive experience. I encourage you to 
highlight, add bookmarks, and take notes in the margins. Expect to be flipping 
between pages and examples. Draw, discuss with others, and practice in your 
tools. This book should be beat up by the time you're done with it: that will be 
one indication that you've utilized it to the fullest extent! 

How to use this book in conjunction with the original 

SWD: let's practice! works as a great companion guide to storytelling with data: 
a data visualization guide for business professionals (Wiley, 2015; henceforth re¬ 
ferred to as SWD). It will not replace the in-depth lessons taught there, but rather 
augment them with additional dialogue, many more examples, and a focus on 
hands-on practice. 

This book generally follows the same chapter structure as SWD with a couple of 
differences, as shown in Figure 0.1. Chapters 7, 8, and 9 are comprehensive ex¬ 
ercises that offer additional guidance and practice applying the lessons covered 
throughout SWD and here. 
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SWD: original book 

the importance of context 1 
choosing an effective visual 2 
clutter is your enemy! 3 
focus your audience's attention 4 
think like a designer 5 
dissecting model visuals 6 
lessons in storytelling 7 
pulling it all together 8 
case studies 9 
final thoughts 10 



SWD: let's practice! 

1 understand the context 

2 choose an effective visual 

3 identify & eliminate clutter 

4 focus attention 

5 think like a designer 

6 tell a story 

7 practice more with Cole 

8 practice more on your own 

9 practice more at work 

10 closing words 


FIGURE 0.1 How SWD chapters correspond to this book 

If you've picked up both SWD and SWD: let's practice!, you can use them in a 
couple of ways. You can read SWD once from start to finish to understand the 
big picture before digging into specifics. From there, you can determine which 
lessons you'd like to practice and can dive into the relevant sections within this 
book. Alternatively, you can peruse SWD one chapter at a time, then turn here to 
practice what you've read through hands-on exercises. 

If you've already read SWD, feel free to jump right in as you will be familiar with 
these topics. 

And if you've only bought this book, there is enough context within to give you 
the basics. You can always pick up a copy of SWD or check out the many resourc¬ 
es at storytellingwithdata.com for supplemental guidance. 


Do you want to learn or teach? 

SWD: let's practice! was written with two different audiences in mind, united by 
a common goal—to communicate more effectively with data. Broadly, these two 
distinct groups are: 

1. Those wanting to learn how to communicate more effectively with data, and 

2. Those wanting to provide feedback, coach, or teach others how to commu¬ 
nicate more effectively with data. 

While the content is relevant for both groups, there will be subtle differences 
when it comes to getting the most out of it. Depending on your goal, the follow¬ 
ing strategies will maximize efficiency. 
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I want to learn to communicate more effectively with data 

Because some later content builds upon or refers back to earlier content or exer¬ 
cises, begin with Chapter 1 and work through in numerical order. After that, you'll 
likely find yourself revisiting sections of interest and focusing your practice based 
on your specific needs and goals. 

Start by reviewing the lesson recap for a given chapter. If you encounter anything 
that isn't familiar and you have access to SWD, turn back to the corresponding 
chapter for additional context. 

After that, move straight into the practice with Cole exercises. First, work through 
each on your own—don't just jump to the solution (you're only cheating yourself!). 
If you're using this book with others, many of these activities lend themselves well 
to group discussion. The exercises in this section don't necessarily need to be 
worked through in order, though they do occasionally build upon prior exercises. 

Once you've spent time on the given exercise (not just in your head: I strongly 
encourage you to write, draw, and use your tools), read through the provided solu¬ 
tion. Observe where there are similarities and differences between that and your 
response. Be aware that there are very few situations where there is a single "right" 
answer. Some approaches are better than others, but there are usually numerous 
ways to solve a given problem. My solutions illustrate just one method that applies 
the lessons covered in SWD. Do read through all of the solutions, as many points of 
advice, tips, and nuances will arise that you will find helpful and insightful. 

After completing the practice with Cole exercises, turn to the practice on your 
own section for more. These problems are similar to those in the first section, 
except that they don't include any predetermined solutions. If you are working 
in a group, have individuals first tackle a given exercise separately, then come 
together to present and discuss. Invariably, different people approach exercises in 
distinct ways, so you can learn a good deal through this sharing process. Confer¬ 
ring with others is also great practice for talking through your design choices and 
decisions, which can further clarify thinking and help improve future application. 
Whether completing on your own or as part of a group, get feedback on your 
recommended approach. This will help you understand if what you propose is 
working, as well as where you can iterate to further improve effectiveness. 

If, at any time, you find yourself with a current project that would benefit from ap¬ 
plying the lessons outlined in a specific chapter, flip straight to the practice at work 
exercise section within that chapter. These contain guided practice that can be ap¬ 
plied directly to real-world work situations. The more you practice implementing the 
various lessons in a work setting, the more they will become second nature. 

Each chapter ends with discussion questions related to the lessons. Talk through these 
with a partner or perhaps even use as the basis of a larger book club conversation. 


do you want to learn or teach? 
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While the exercise sections in each chapter focus primarily on applying the given 
lesson, Chapters 7, 8, and 9 offer more comprehensive examples and exercises 
for applying the entire storytelling with data process. Chapter 7 ("practice more 
with Cole") contains full-blown case studies presented for you first to solve, fol¬ 
lowed by my thought process for tackling and completing. Chapter 8 ("practice 
more on your own") has additional case studies and robust exercises to practice 
the process without prescribed solutions. Chapter 9 ("practice more at work") has 
tips on how to apply the storytelling with data process at work, guides to facilitate 
group learning, and assessment rubrics that you can use to evaluate your own 
work and seek feedback from others. 

As part of your learning, it's also imperative that you set specific goals. Commu¬ 
nicate these to a friend, colleague, or manager. See Chapter 9 for more on this. 

Next, let's talk about how those interested in teaching others to effectively tell 
stories with data can use this book. 

I want to provide feedback, coach, or teach others 

You might be a manager or leader who wants to give good feedback on a graph 
or presentation from your team. Or perhaps you have a role in learning and de¬ 
velopment and are building training programs around how to communicate effec¬ 
tively with data. You may be a university instructor teaching students this import¬ 
ant skill. In all of these scenarios, the chapter recap will provide an overview of the 
given lesson. After that, you will likely find the most value in the second and third 
exercise sections: practice on your own and practice at work. Each chapter ends 
with discussion questions that can be assigned, incorporated into tests, or used 
as the basis of group conversations. 

The practice on your own section within each chapter contains targeted exercises 
helping those undertaking them practice the lessons outlined in the respective 
chapter and relevant section of SWD. These can be used as the basis of hands-on 
exercises in a classroom setting or assigned as homework. Some will also lend 
themselves well for use as group projects. These examples are provided without 
prescribed solutions. The problems in these sections can also work as models: 
consider where you could substitute data or visuals to create unique exercises. 

Practice at work's guided exercises can be used directly in a work setting as part 
of an ongoing program for professionals. They can be assigned, completed, and 
discussed in a group or classroom setting. Managers looking to develop their 
team's skills may ask them to focus on specific exercises through their work or 
projects, or use with individuals as part of a goal-setting or career development 
process. For those teaching, Chapter 9 has additional practice at work exercises, 
including facilitator guides and assessment rubrics. 
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A quick note on tools 

Many tools are available for visualizing data. You may use spreadsheet applica¬ 
tions like Excel or Google Sheets. Perhaps you are familiar with chart creators such 
as Datawrapper, Flourish, or Infogram or data visualization software like Tableau 
or PowerBI. Maybe you write code in R or Python or leverage Javascript libraries 
like D3.js. Regardless of your tool of choice, pick one or a set of tools and get 
to know them as best you can so the instrument itself doesn't become a limiting 
factor for effectively communicating with data. No tool is inherently good or evil— 
pretty much any can be used well or not so well. 

When it comes to undertaking the exercises in this book, you are encouraged to 
use whatever means for visualizing the data you have at your disposal. These may 
be tools you use currently, or possibly one or more that you'd like to learn. The 
visuals that illustrate the practice with Cole solutions were all created in Microsoft 
Excel. That said, this is certainly not your only choice and I welcome you to use 
other tools. We are also adding solutions built in other tools to our online library 
for you to explore. 

On the topic of tools, there are a couple I highly recommend having on hand 
while reading this book: a pen or pencil and paper. You may consider dedicating 
a notebook to use as you work your way through the various exercises. Many 
direct you to write and sketch. There are important benefits to low-tech physical 
creation and iteration that we'll explore and practice, which can make the process 
of working in your technical tools more efficient. 

Where to get the data 

Downloads for the data throughout this book and for all of the visuals shown in 
the solutions for the practice with Cole exercises can be found at storytellingwith- 
data.com/letspractice/downloads. 

Let's get started 

There has never been a time in history where so many people have had access 
to so much data. Yet, our ability to tell stories with our graphs and visualizations 
has not kept pace. Organizations and individuals that want to move ahead must 
recognize that these skills aren't inherent and invest in their development. With a 
thoughtful approach, we can all tell inspiring and influential stories with our data. 

I'm excited to help you take your data storytelling to the next level. 


Let's practice! 




chapter one 


understand 
the context 

A little planning can go a long way and lead to more concise and effective com¬ 
munications. In our workshops, I find that we allocate an increasing amount of time 
and discussion on the very first lesson we cover, which focuses on context. People 
come in thinking they want data visualization best practices and are surprised by 
the amount of time we spend on—and that they want to spend on—topics related 
more generally to how we plan for our communications. By thinking about our 
audience, message, and components of content up front (and getting feedback 
at this early stage), we put ourselves in a better position for creating graphs, pre¬ 
sentations or other data-backed materials that will meet our audience's needs and 
our own. 

The exercises in this chapter focus primarily on three important aspects of the 
planning process: 

1. Considering our audience: identifying who they are, what they care about, 
and how we can better get to know them and design our communications 
with them in mind. 

2. Crafting and refining our main message: the Big Idea was introduced briefly 
in SWD; here, we'll undertake a number of guided and independent exercises 
to better understand and practice this important concept. 

3. Planning content: storyboarding is another concept that was introduced in 
SWD —we'll look at a number of additional examples and exercises related to 
what we include and how we organize it. 


Let's practice understanding the context! 

First, we'll review the main lessons from SWD Chapter 1. 
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Exercise 1.1: get to know your audience 

Who is my audience? What do they care about? These may seem like obvious 
questions to ask ourselves when we step back and think about it, but too often we 
completely skip this step. Getting to know our audience and understanding their 
needs and what drives them is an important early part of the process for success¬ 
fully communicating with data. 

Let's examine what this looks like in the wild and how we can get to know a new 
audience. 

Imagine you work as a People Analyst (a data analyst within the Human Resources, 
or HR, function) at a medium-sized company. A new head of HR has just joined the 
organization (she is now your boss's boss). You've been asked to pull together an 
overview with data to help the freshly hired head of HR get up to speed with the 
different parts of the business from a people standpoint. This will include things 
like interview and hiring metrics, a headcount review across different parts of the 
organization, and attrition data (how many are leaving and why they are leaving). 
Some of your colleagues in other groups within HR have already had meet-and- 
greets with the new leader and given their respective synopses. Your direct man¬ 
ager recently had lunch with the new head of HR. 

How could you get to better know your audience (the new head of HR) in this 
circumstance? List three things you could do to understand your audience, what 
she cares about, and how to best address her needs. Be specific in terms of what 
questions you would seek to answer. Get out your pen and paper and physically 
write down your responses. 


PRACTICE wM, COLE 
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Solution 1.1 : get to know your audience 

Since this isn't likely a case where we can ask our audience directly what she cares 
about, we'll need to get a little creative. Here are three things I could do to set 
myself up for success when it comes to better understanding my audience and 
what matters to her most: 

1. Set up time to get a debrief from colleagues who have already met with the 
new leader. Talk to those who have had conversations with the new head of 
HR. How did those discussions go? Do they have any insight on this new lead¬ 
er's priorities or points of interest? Is there anything that didn't go well from 
which you can learn and adapt? 

2. Talk to my manager to get insight. My manager has lunched with the new 
leader: what insight did he get about potential first points of focus? I also 
need to understand what my manager sees as important to focus on in this 
initial meeting. 

3. Use my understanding of the data and context plus some thoughtful design 
to structure the document. Given that I've been working in this space for a 
while, I have a big picture understanding of the different main topics that 
someone new to our organization will assumably be interested in and the 
data we can use to inform. If I'm strategic in how I structure the document, 
I can make it easy to navigate and meet a wide variety of potential needs. I 
can provide an overview with the high level takeaways up front. Then I can 
organize the rest of the document by topic so the new leader can quickly turn 
to and get more detail on the areas that most interest her. 
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Exercise 1.2: narrow your audience 

There is tremendous value in having a specific audience in mind when we commu¬ 
nicate. Yet, often, we find ourselves facing a wide or mixed audience. By trying to 
meet the needs of many, we don't meet any specific need as directly or effectively 
as we could if we narrowed our focus and target audience. This doesn't mean that 
we don't still communicate to a mixed audience, but having a specific audience in 
mind first and foremost means we put ourselves in a better position to meet that 
core audience's needs. 

Let's practice the process of narrowing for purposes of communicating. Well start 
by casting a wide net and then employ various strategies to focus from there. 
Work your way through the questions and write out how you would address them. 
Then read the following pages to better understand various strategies for narrow¬ 
ing our audience. 

You work at a national clothing retailer. You've conducted a survey asking your 
customers and the customers of your competitors about various elements related 
to back-to-school shopping. You've analyzed the data. You've found there are 
some areas where your company is performing well, and also some other areas of 
opportunity. You're nearing the point of communicating your findings. 

QUESTION 1: There are a lot of different groups of people (at your company 
and potentially beyond) who could be interested in this data. Who might care 
how your stores performed in the recent back-to-school shopping season? Cast 
as wide of a net as possible. How many different audiences can you come up with 
who might be interested in the survey data you've analyzed? Make a list! 

QUESTION 2: Let's get more specific. You've analyzed the survey data and found 
that there are differences in service satisfaction reported by your customers across 
the various stores. Which potential audiences would care about this? Again, list 
them. Does this make your list of potential audiences longer or shorter than it 
was originally? Did you add any additional potential audiences in light of this new 
information? 

QUESTION 3 Let's take it a step further. You've found there are differences in 
satisfaction across stores. Your analysis reveals items related to sales associates 
as the main driver of dissatisfaction. You've looked into several potential courses 
of action to address this and determined that you'd like to recommend rolling 
out sales associate training as a way to improve and bring consistency to service 
levels across your stores. Now who might your audience be? Who cares about 
this data? List your primary audiences. If you had to narrow to a specific decision 
maker in this instance, who would that be? 
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Solution 1.2: narrow your audience 

QUESTION 1: There are many different audiences who might care about the 
back-to-school shopping data. Here are some that I've come up with (likely not a 
comprehensive list): 

• Senior leadership 

• Buyers 

• Merchandisers 

• Marketing 

• Store managers 

• Sales associates 

• Customer service people 

• Competitors 

• Customers 

Eventually, everyone in the world may care about this data! Which is great, but 
not so helpful when it comes to narrowing our audience for the purpose of com¬ 
municating. There are a number of ways we can narrow our audience: by being 
clear on our findings, specific on the recommended action, and focused on the 
given point in time and decision maker. The answers to the remaining questions 
will illustrate how we can focus in these ways to have a specific audience in mind 
when we communicate. 

QUESTION 2: If service levels are inconsistent across stores, the following audi¬ 
ences are likely to care most: 

• Senior leadership 

• Store managers 

• Sales associates 

• Customer service people 

QUESTION 3: We want to roll out training—that sparks some questions for me. 
Who will create and deliver the training? How much will it cost? With this addition¬ 
al clarity, some new audiences have entered the mix: 

• Senior leadership 

• HR 

• Finance 

• Store managers 

• Sales associates 

• Customer service people 
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The preceding list may all eventually be audiences for this information. We've 
noted inconsistencies with service levels and need to conduct training. HR will 
have to weigh in on whether we can meet this need internally or if it will require 
us to bring in external partners to develop or deliver training. Finance controls the 
budget and we'll have to figure out where to get the money to pay for this. Store 
managers will need to buy-in so they are willing to have their employees spend 
time attending the training. The sales associates and customer service people will 
have to be convinced that their behavior needs to change so that they will take 
the training seriously and provide consistent high quality service to customers. 

But not all of these groups are immediate audiences. Some of the communica¬ 
tions will take place downstream. 

To narrow further, I can reflect on where we are at in time: today. Before we can do 
any of the above, we need approval that rolling out training is the right course of 
action. A decision needs to be made, so another way of narrowing my audience 
is to be clear on timing as well as who the decision maker (or set of decision mak¬ 
ers) is within the broader audience. In this instance, I might assume the ultimate 
decision maker—the person who will either say, "yes, I'm willing to devote the 
resources; let's do this," or "no, not an issue; let's continue to do things as we 
have been"—is a specific person on the leadership team: the head of retail sales. 

In this example, we have employed a number of different ways to narrow our tar¬ 
get audience for the purpose of the communication. We narrowed by: 

1. Being specific about what we learned through the data, 

2. Being clear on the action we are recommending, 

3. Acknowledging what point we're at in time (what needs to happen now), and 

4. Identifying a specific decision maker. 

Consider how you can use these same tactics to narrow your audience in your 
own work. Exercise 1.18 in practice at work will help you do just that. But before 
we get there, let's continue to practice together and turn our attention to a useful 
resource: the Big Idea worksheet. 
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Exercise 1.3: complete the Big Idea worksheet 

The Big Idea is a concept that can help us get clear and succinct on the main mes¬ 
sage we want to get across to our audience. The Big Idea (originally introduced 
by Nancy Duarte in Resonate, 2010) should (1) articulate your unique point of 
view, (2) convey what's at stake, and (3) be a complete sentence. Taking the time 
to craft this up front helps us get clarity and concision on the overall idea we need 
to communicate to our audience, making it easier and more streamlined to plan 
content to get this key message across. 

In storytelling with data workshops, we use the Big Idea worksheet to help craft 
our Big Idea. Attendees commonly express how unexpectedly helpful they find 
this simple activity. We'll do a few related exercises so you can practice and see 
examples of the Big Idea worksheet in action. Let's start by continuing with the 
example we just worked through for narrowing our audience. As a reminder, the 
basic context follows. 

You work at a national clothing retailer. You've conducted a survey asking your 
customers and the customers of your competitors about various elements related 
to back-to-school shopping. You've analyzed the data. You've found there are 
some areas where your company is performing well, as well as some areas of 
opportunity. In particular, there are inconsistencies in service levels across stores. 
Together with your team, you've explored some different potential courses of ac¬ 
tion for dealing with this and would like to recommend solving through sales 
associate training. You need agreement that this is the right course of action and 
approval for the resources (cost, time, people) it will take to develop and deliver 
this training. 

Think back to the audience we narrowed to in Exercise 1.2: the head of retail. 
Work your way through the Big Idea worksheet on the following page for this 
scenario. Make assumptions as needed for the purpose of the exercise. 
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the BIG IDEA worksheet 

Identify a project you are working on where you 

need to communicate in a data-driven way 

Reflect upon and fill out the following. p RO J ECT 

WHO IS YOUR AUDIENCE? 

(1) List the primary groups or individuals to (3) What does your audience care about? 

whom you'll be communicating. 

4 What action does your audience need to take? 

(2) If you had to narrow that to a single person, 
who would that be? 


WHAT IS AT STAKE? 

What are the benefits if your audience acts What are the risks if they do not? 

in the way that you want them to? 


FORM YOUR BIG IDEA 

It should: 

articulate your point of view, 
convey what's at stake, and 
be a complete (and single!) 
sentence. 


FIGURE 1,3a The Big Idea worksheet 
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Solution 1.3: complete the Big Idea worksheet 


the BIG IDEA worksheet storytellingliiidata 


Identify a project you are working on where you 
need to communicate in a data-driven way. 
Reflect upon and fill out the following. 


PROJECT 


Back-fa-School opportunity 


WHO IS YOUR AUDIENCE? 

List the primary groups or individuals to 
whom you'll be communicating. 

the executive team 


(2 If you had to narrow that to a single person, 
who would that be? 

/fie head of retail 


What does your audience care about? 

- l/aving a highly profitable 
buck- to-school shopping jeoson 

- p/lalnng customers happy because 
happier customers' spend more 

- beating the. competition 

What action does your audience need to take? 

Agree that- -training is the. right rvoy 
lo deal with inconsistent service levels 
and approve the- re sources it 
lA/ill bate fo mate thaf \ happen 
(cost, time, people) ^ 


WHAT IS AT STAKE? 

What are the benefits if your audience acts 
in the way that you want them to? 

- belter service levels - happier customers 

-happiercustomers spend more, 
come back more often, 
tell friends about their 
positive experience- 


What are the risks if they do not? 

- no action could lead to 
negative word of mouth 

-people shopping with CompetitoiS 

- reputational risk 

- last revenue 


FORM YOUR BIG IDEA 

It should: 

articulate your point of view, 
(2 convey what's at stake, and 
(3 be a complete (and single!) 


let's invest in sales associate training 
to improve the in-store shopping 
experience and mate the upcoming 
back-to-school season the best 
revenue generating one yet! 


FIGURE 1.3b Completed Big Idea worksheet 

















refine & reframe 


13 


Exercise 1.4: refine & reframe 

Consider both your Big Idea from Exercise 1.3 and the one I came up with in Solu¬ 
tion 1.3. Answer the following questions. 

QUESTION 1: Compare and contrast. Are there common points where they are 
similar? How are they different? Which do you find to be more effective and why? 

QUESTION 2: How did you frame? Reflect on the Big Idea you originally crafted. 
Did you frame it positively or negatively? What is the benefit or risk in your Big 
Idea? How could you reframe it to be the opposite? 

QUESTION 3: How did I frame? Revisit the Big Idea articulated in Solution 1 .3. 
Is it framed positively or negatively? What is the benefit or risk in this Big Idea? 
Again, how could you reframe it to be the opposite? How else might you refine? 
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Solution 1.4: refine & reframe 

Given that I don't have your Big Idea as I write this, I'll focus on Question 3, which 
poses some questions about mine. Here it is again for reference: 

Lei's invert in sales associate training tv improve 

ihe in-store shopping experience and mate the upcoming 

back-to-School season the best revenue generating one yetf 

How did I frame? What is the benefit or risk? This is currently framed positively, 
focusing on the benefit of the revenue we stand to gain by investing in sales as¬ 
sociate training. 

How could you reframe it to be the opposite? I could reframe negatively a cou¬ 
ple of different ways. One simple way would be to focus on the same thing at 
stake'—revenue'—but change to emphasize the loss that could result from not 
taking action. 

if we dorij invest in sales associate training to improve 
service levels, we will lose customers and have lower revenue 
for the upcoming back-to school shopping season. 

But revenue isn't the only thing at stake. What if I know that my audience is highly 
motivated by beating the competition? Then I could try something like this: 

l /\/e are losing to the Competition when it comes to important 
aspects of our store experience - we will continue to lose 
unless we invest in sales associate training to improve 
Ihe customer experience across our stores. 

How else can we refine this Big Idea? There's no single right answer. There are a 
number of different potential benefits (more satisfied customers, greater revenue, 
beating the competition) and risks (unhappy customers, lower revenue, losing to 
competition, negative word of mouth, reputational damage). What we assume 
our audience cares most about will influence how we frame and what we focus on 
in our Big Idea. 

In a real-life scenario, we'd want to know as much about our audience as we can to 
make smart assumptions. Check out Exercise 1.17 in practice at work for guidance 
on getting to know your audience. Next, let's look at another Big Idea worksheet. 


complete another Big Idea worksheet 


15 


Exercise 1.5: complete another Big Idea worksheet 

Let's do another practice run with the Big Idea worksheet. 

Imagine you volunteer for your local pet shelter, a nonprofit organization whose 
mission is to improve the quality of animal life through veterinary care, adoptions, 
and public education. You help organize monthly pet adoption events, which feed 
into the organization's broader goal of increasing permanent adoptions of pets 
by 20% this year. 

Traditionally, these monthly events have been held in outdoor spaces in your com¬ 
munity (parks and greenways) on Saturday mornings. However, last month's event 
was different. Due to poor weather, the event was relocated indoors to a local pet 
supply retailer. Surprisingly, after the event, you observed something interesting: 
nearly twice as many pets were adopted compared to previous months. 

You have some initial ideas about the reasons for this increase and think there's 
value in holding more adoption events at this retailer. You'd like to conduct a pilot 
program over the next three months to see if the results help confirm your beliefs. 
To implement this pilot program, you'll need additional support from the pet shel¬ 
ter's marketing volunteers to publicize the events. You've estimated the monthly 
costs to be $500 for printing and three hours of a marketing volunteer's time. You 
want to ask the event committee to approve the pilot program at next month's 
meeting and are planning your communication. 

Complete the Big Idea worksheet on the following page for this scenario, making 
assumptions as necessary for the purpose of the exercise. 
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the BIG IDEA worksheet 

Identify a project you are working on where you 
need to communicate in a data-driven way. 

Reflect upon and fill out the following. p RO J ECT 

WHO IS YOUR AUDIENCE? 

List the primary groups or individuals to 3) What does your audience care about? 

whom you'll be communicating. 

4 What action does your audience need to take? 

(2) If you had to narrow that to a single person, 
who would that be? 


WHAT IS AT STAKE? 

What are the benefits if your audience acts What are the risks if they do not? 

in the way that you want them to? 


FORM YOUR BIG IDEA 

It should: 

articulate your point of view, 
convey what's at stake, and 
be a complete (and single!) 
sentence. 


FIGURE 1.5a The Big Idea worksheet 
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Solution 1.5: complete another Big Idea worksheet 

The following illustrates one way to complete the Big Idea worksheet for this scenario. 


PROJECT 


storytelling!.!! d ata . 
Adoption venue pilot 


the BIG IDEA worksheet 

Identify a project you are working on where you 
need to communicate in a data-driven way. 

Reflect upon and fill out the following. 

WHO IS YOUR AUDIENCE? 

List the pnmary groups or individuals to 
whom you'll be communicating. 

Shelter events planning committee 

7 hey 'll decide boscJ 0* a 

meyonfy vote 


If you had to narrow that to a single person, 
who would that be? 

Jane H arper. the most influential 
person on the committee whose opinion 
would likely affect ihe outcome 


3) What does your audience care about? 

Increasing pet adoptions- m general 
and specifically toward the organization's 
207, increase goal, Which i*ill improve ability 
to fundraise : Ihey are cost-conscious. 
so low cosI options arc often supported 
What action does your audience need to take? 

Approve my pilot program of holding pet 
adoptions ala local pet supply retailer for 
Ite. next- 3 months and provide additional 
marketing resources * fsoo to print posters 
I 3 hours /month op a marketing 
volunteer's time 


WHAT IS AT STAKE? 

What are the benefits if your audience acts 
in the way that you want them to? 

More adoptions (lower eulhanization ), 
iA/hich will help us achieve Ihe 
broader 20 /. goal . and help 
with future fundraising 


What are the risks if they do not? 

" A/t Used opportunity bo 

Increase adoptions 

- Art ore animals don't find homes 

- Greater euthahization 
f associated cost 

- Arlin 2o-/. goal 


FORM YOUR BIG IDEA 

It should: 

articulate your point of view, 

(2) convey what's at stake, and 

(3) be a complete (and single!) 
sentence. 


Approve our low-cost- pilot program 
that has potential 10 markedly increase 
adoptions and result in better hutt/re 
fundraising opportunities. 


FIGURE 1.5b Completed Big Idea worksheet 
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Exercise 1.6: critique the Big Idea 

Being able to give good feedback on the Big Idea is important both when we 
work with others, as well as for critiquing and refining our own work. Let's practice 
giving feedback on the Big Idea. 

Suppose you work for a health care center that has been analyzing recent vaccine 
rates. Your colleague has been focusing on progress and opportunities related to 
flu vaccines. He has crafted the following Big Idea for the update he is preparing 
and has asked for your feedback. 

While flu vaccination rates have improved since last year, we need to increase the 
rate in our area by 2% to hit the national average. 

With this Big Idea in mind, write a few sentences outlining your response to the 
following. 

QUESTION 1: What questions might you ask your colleague? 

QUESTION 2: What feedback would you provide on his Big Idea? 
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Solution 1.6: critique the Big Idea 

QUESTION 1: The immediate questions I'd have for my colleague would be 
about their audience: who are they? What do they care about? 

QUESTION 2: In terms of giving specific feedback on the Big Idea, let's think 
back to the components of the Big Idea—it should (1) articulate your point of 
view, (2) convey what's at stake, and (3) be a complete (and single!) sentence. Let's 
consider each of these in light of my colleague's Big Idea. 

1. Articulate your point of view. The point of view is that vaccination rates are 
low compared to the national average and need to be increased. 

2. Convey what's at stake. This isn't clear to me currently. I'm going to want to ask 
some targeted questions to better understand what is at stake for the audience. 

3. Be a complete (and single!) sentence. Good job on this front. It's often diffi¬ 
cult to summarize our point in a single sentence. If anything, we have room to 
possibly add a little more to the sentence to make it meatier and more clearly 
convey what is at stake. 

In general, the Big Idea in its current form gives me the what (increase vaccination 
rates), but not the why (it also doesn't get into the how, though there's only so 
much we can fit into a sentence, and this piece can come into play through the 
supporting content). 

You could argue that the why is because we're lower than the national average, 
but this doesn't feel compelling enough. Is my audience going to be motivated 
by a national average comparison? Is that even the right goal? Is it aggressive 
enough? Too aggressive? Can we get more specific by thinking through what will 
be most motivating for the audience? 

It's clear my colleague believes that we should increase flu vaccination rates. But 
let's consider why our audience should care. What does this mean for them? Are 
they motivated by competition—maybe we're lower than that other medical cen¬ 
ter across town, or our area is low compared to the state, or perhaps the national 
comparison is the right one but can be articulated in a more motivating way? Or 
maybe my audience is driven by generally doing good—we could get into pa¬ 
tient advantages or highlight general community well-being benefits that would 
be well served by increasing vaccination rates. If we think about positive versus 
negative framing—which will be best for this scenario and audience? 

The conversation I have with my colleague will cause him to explain his thought 
process, what he knows about his audience, and what assumptions he's making. 
The dialogue we have will help him both refine his Big Idea as well as be better 
prepared to talk through this with his ultimate audience. Success! 
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Exercise 1.7: storyboard! 

I sometimes feel like a broken record because I say this so frequently: storyboard¬ 
ing is the most important thing you can do up front as part of the planning process 
to reduce iterations down the road and create better targeted materials. A sto¬ 
ryboard is a visual outline of your content, created in a low-tech manner (before 
you create any actual content). My preferred tool for storyboarding is a stack of 
sticky notes, which are both small—forcing us to be concise in our ideas—and 
lend themselves to being easily rearranged to explore different narrative flows. I 
typically storyboard in three distinct steps: brainstorming, editing, and seeking 
and incorporating feedback. 

We'll do a couple of practice storyboarding runs so you can both get a feel for it 
and see illustrative approaches. Let's start with an example you should be familiar 
with now (we've seen it previously in Exercises 1.2, 1.3, and 1.4). As a reminder, 
the basic context follows. 

You work at a national clothing retailer. You've conducted a survey asking your 
customers and the customers of your competitors about various elements related 
to back-to-school shopping. You've analyzed the data. You've found there are 
some areas where your company is performing well, as well as some areas of op¬ 
portunity. In particular, there are inconsistencies in service across stores. Together 
with your team, you've explored different potential courses of action for dealing 
with this and would like to recommend solving through sales associate training. 
You need agreement that this is the right course of action and approval for the 
resources (cost, time, people) it will take to develop and deliver this training. 

Look back to the Big Idea that you created in Exercise 1.3 (or if you didn't create 
one, select one of the Big Ideas from Solutions 1.3 or 1.4). Complete the following 
steps with a specific Big Idea in mind. 

STEP 1: Brainstorm! What pieces of content may you want to include in your 
communication? Get a blank piece of paper or a stack of stickies and start writing 
down ideas. Aim for a list of at least 20. 

STEP 2: Edit. Take a step back. You've come up with a ton of ideas. How could 
you arrange these so that they make sense to someone else? Where can you com¬ 
bine? What ideas did you write down that aren't essential and can be discarded? 
When and how will you use data? At what point will you introduce your Big Idea? 
Create your storyboard or the outline for your communication. (I highly recom¬ 
mend using sticky notes for this part of the process!) 
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STEP 3: Get feedback. Grab a partner and have them complete this exercise, 
then get together and talk about it. How are your storyboards similar? Where 
do they differ? If you don't have a partner who has completed the exercise, you 
can still talk someone through your plan. What changes would you make to your 
storyboard after talking through it with someone else? Did you learn anything 
interesting through this process? 
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Solution 1.7: storyboard! 

Looking back to Exercise 1.3, my Big Idea was the following: 

Lei's invest it) sales associate -training tv improve 

ihe in-store shopping experience and mate the upcoming 

back-to-School season the best revenue generating one yetf 

I'll keep this in mind as I work through the storyboarding steps. 

STEP 1: Below is my initial list of potential topics/pieces of content to include 
from my brainstorming process. 

I \t\slonca\ cont&t (back toschool shopping is important) 

2 . Problem were frying to solve (historically not data driven) 

3, Different ways we envisaged solving the problem 
i). Course of action we undertook : survey 

5. Survey : customer groups we asked, general demographics, response rates 

(o- Survey: details on competitors we included 

1. Survey, questions we asked, open and close date of survey 

9. Data: how our store, compares across the various items 
c j. Data • honi this breaks down across stores and regions 

10. Data-, how we compare to the competition 

1 1■ Data: how competitor comparison breaks down by stores l regions 

12. Good news'- where were doing best or beating competition 
(with store breakdown) 

13. Bad news: where were doing worse or lowerthan competition 
(with store breakdown) 

|*j. Areas for improvement 

IS. Potential remedies 

lb. Recommended course of action : invest m sales (-raining 

17, Resources needed (people, budget) 

18 , m/hah this will solve 
\°l projected time-line 

20. Discussion to have /decision to be made 
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STEP 2: Figure 1.7 illustrates how I might curate the preceding list into a storyboard. 



(Jack lo-rchool 
slnoppin^ IS 
important 
(DCMONSTRATt) 


Historically 

NOT 

data-driven 


l/Vays we 
considered 
becomi ng 
lyiore data-driven 


Action/ 

/tN/UYTIf 


, i All Tine detail? 
purvey. _ w(j0 ^ as p ec j 

- WHO responded 

- Competitors 

- eta 


Wtat we 
learned 
from the data 
(analysis leSults) 



OPPORTUNITY • 
inconsistencies 
m service 
levels 


RfCOMMtND/tTION: 
invest m 

employee -framing 
(r details) 


DitaufttoN/ 

APPROV/R 


FIGURE 1.7 Back-to-school shopping: a potential storyboard 

Does Figure 1.7 illustrate the "right" answer? No. Will you always end up with a 
perfect grid of sticky notes like this? Not likely. Are there things you would have 
done differently? Probably. Are there additional changes that I would make to 
this? Yes. We'll revisit this scenario again a little later to explore how we can further 
refine this storyboard. But for now, take this as one illustrative storyboard and let's 
turn our attention to Step 3. 

STEP 3: What feedback do you have for me on this storyboard? How is yours 
similar? Where does it differ? Consider how you can apply this approach to a cur¬ 
rent project you face. Exercises 1.23, 1.24, and 1.25 in practice at work will help 
you do just that. Before we get there, let's do some additional guided practice 
storyboarding. 
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Exercise 1.8: storyboard (again!) 

For this exercise, we'll create a storyboard using the pet adoptions pilot program 
we introduced in Exercise 1.5. As a reminder, the background is as follows: 

Imagine you volunteer for your local pet shelter, a nonprofit organization whose 
mission is to improve the quality of animal life through veterinary care, adoptions, 
and public education. You help organize monthly pet adoption events, which feed 
into the organization's broader goal of increasing permanent adoptions of pets 
by 20% this year. 

Traditionally, these monthly events have been held in outdoor spaces in your com¬ 
munity (parks and greenways) on Saturday mornings. However, last month's event 
was different. Due to poor weather, the event was relocated indoors to a local pet 
supply retailer. Surprisingly, after the event, you observed something interesting: 
nearly twice as many pets were adopted this month compared to previous months. 

You have some initial ideas about the reasons for this increase and think there's 
value in holding more adoption events at this retailer. You'd like to conduct a pi¬ 
lot program over the next three months to see if these results help confirm your 
beliefs. To implement this pilot program, you'll need additional support from the 
pet shelter's marketing volunteers to publicize the events. You've estimated the 
monthly costs to be $500 for printing and three hours of a marketing volunteer's 
time. You want to ask the event committee to approve the pilot program at next 
month's meeting and are planning your communication. 

Look back to the Big Idea that you created in Exercise 1.5 (or if you didn't create 
one, revisit the Big Idea from Solution 1.5). Complete the following steps with this 
specific Big Idea in mind. 

STEP 1 : Brainstorm! In this first step, brainstorm what details might be necessary 
to include in the eventual presentation. Get a blank piece of paper or a stack of 
stickies and start writing down ideas. Aim for a list of at least 20. To aid in your 
brainstorming process, ask yourself: has the organization ever tried a pilot pro¬ 
gram before? Will the events committee need to understand the risks and bene¬ 
fits of this program? Are they likely to respond favorably or unfavorably? Do you 
have historical data on the number of adoptions from community spaces? Are you 
aware whether other shelters have successfully tried this? How will you measure 
and assess the results from the three-month pilot? What does success look like? 

STEP 2: Edit. Examine all the ideas you generated in Step 1. Next, let's plan how 
to put them to use. Determine which pieces of potential content are essential and 
which can be discarded. Create your storyboard or the outline for the presenta¬ 
tion. To aid in the editing and arranging process, ask yourself: having identified 
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your audience's likely response in Step 1, will you start with the Big Idea or build 
up to it? How familiar is your audience with the recent success-—will you need 
to communicate this context or is it already well known? Which other details are 
new to the audience and may require more time or data behind them? Will your 
audience be accepting of your proposal or will you need to convince them? How 
can you best do that? 

STEP 3: Get feedback. Grab a partner and have them complete this exercise, 
then get together and talk about it. How are your storyboards similar? Where 
do they differ? If you don't have a partner who has completed the exercise, you 
can still talk someone through your plan. What changes would you make to your 
storyboard after talking through it with someone else? Did you learn anything 
interesting through this process? 
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Solution 1.8: storyboard (again!) 

Looking back to Exercise 1.5, my Big Idea was the following: 

A p prove our low-cost pilot- program, which has 
potenlial f 0 markedly increase adoptions and 
re full in better fuli/re fundraising opportunities. 

I'll keep this in mind as I work through the storyboarding steps. 

STEP 1: Following is my initial list of potential topics to include from my brain¬ 
storming process. 

I Historical context: we've always held adoptions ah c\ community Space 

2, Current state : review benefits and how man y were adopted per month 

3. Outline how current number of pet adoptions feeds info broader goal 
of lot increase 

H, background ov\ why last month's event was held indoors 

5, Results: we Saw a lx increase- in adoptions 

C,. Drivers: possible reasons why this happened 

7. Drivers; possible reasons why this may continue if we try again 

Opportunity: introduce 3-month plot program 

°t, /InalysiS ; benefits ( risk of pilot program 

10, Resources needed s explain additional marketing cost of tsoo 

lli Resources needed: consider additional marketing time of 3 volunteer hours 

12 , /\d<i\tiona\ requirements; approval from pet supply store manager, 
comms to employees 

IS- Additional requirements; logistics for planning 4 set op in store 
|i|, Dafia : what other pel shelters have done 
15, Recommendat ion! approve this pilot program 
\(o. Discussion: ways i^ere working lo meet lot, increase goal 
17, Timeline 4 proposed dates 

10 , how we'll track 4 measure Success for 3 months 
ig. Implications for fundraising 

20. Discussion 4 decision ft> be rv]ade 
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STEP 2: Figure 1.8 illustrates how I could curate the preceding list into a storyboard. 


BACKGROUND 


OPPO RTUNITY 


proposal 


GOA l 1 . 
j nuease. 
rado pfions 
by 2o7. 


last IWltlfl'S 

event- was 
unexpectedly 
Successful 


Introduce 
?- month 



COV TEXT; 
Current 
state of 
adoptions 

Show bow 
adoptions 

Kelp , 
fundraising 


Introduce 
possible 
root causes 

Whatcom 
we gam 
from 

replicating 

Show data 
demonstrating 
success for 
other shelters 

DlScim 

investment; 

JGOO cost 

3 houis time 

How well 
measure 
results 

Recommend 

approi^ 
program i 
resources 


FIGURE 1.8 Pet adoption pilot program: a potential storyboard 

STEP 3: What feedback do you have for me on this storyboard? How is yours sim¬ 
ilar? Where does it differ? How can you apply this approach to a current project 
you face? Refer to exercises 1.23, 1.24, and 1.25 for guidance on storyboarding 
at work. 


You've practiced narrowing your audiences, crafting your Big Idea, and story¬ 
boarding with me. Next, you'll find more low-risk practice for you to tackle on 
your own. 
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PRACTICE 
on y/mr 

OWN 


If i5 by continuing to practice, 'that truly 
Understanding your audience and integrating 
important low-tech planning i/vil/ feel 
constructive and become p art of- your regular 
routine. Lei's undertake additional exercises 
to help lorm these good ha hits. 


Exercise 1.9: get to know your audience 

Let's say you work at a consulting company. You have a new client, the director of 
marketing at a prominent pet food manufacturer. You are one level removed from 
your audience: rather than interfacing with them directly, you provide analysis and 
reports to your boss, who presents this work to the client, discusses it with them, 
and then communicates any feedback or additional needs to you. 

How can you better get to know your audience in this case? List three things you 
could do to better understand your audience and what they care about. How 
does having the intermediate audience of your manager potentially complicate 
things? How might you use this to your advantage? What other considerations do 
you need to make in order to be successful in this scenario? 

Write a paragraph or two to answer these questions. 
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Exercise 1.10: narrow your audience 

Next, you'll practice narrowing the audience. Read the following, then work your 
way through the various questions posed to determine how you can narrow your 
audience for purposes of communicating given different assumptions. 

Imagine you work for a regional medical group. You and several colleagues have 
just wrapped up an evaluation of Suppliers A, B, C, and D for the XYZ Products 
category. Your analysis examined historical costs by facility, patient and physician 
satisfaction, and cost projections going forward. You are in the process of creating 
a presentation deck with this information. 

QUESTION 1: There are a lot of different groups of people (at your company 
and potentially beyond) who may be interested in this data. Who can you think of 
who is apt to care how the various suppliers compare when it comes to historical 
usage, patient and physician satisfaction, and cost projections? Cast as wide of a 
net as possible. How many different audiences can you come up with who might 
be interested in this information? List them! 

QUESTION 2: Let's get more specific. The data shows that historical usage has 
varied a lot by medical facility, with some using primarily Supplier B and oth¬ 
ers using primarily Supplier D (and only limited historical use of Suppliers A and 
C). You've also found that satisfaction is highest across the board for Supplier B. 
Which potential audiences might care about this? Again, list them. Does this 
make your list of potential audiences longer or shorter than it was originally? Did 
you add any additional potential audiences in light of this new information? 

QUESTION 3: Time to take it a step further. You've analyzed all of the data and 
realized there are significant cost savings in going with a single or dual supplier 
contract. However, either of these will mean changes for some medical centers 
relative to their historical supplier usage. You need a decision on how to best 
move forward strategically in this space. Now who might your audience be? Who 
cares about this data? List your primary audiences. If you had to narrow to a spe¬ 
cific decision maker, who would that be? 


Practice tm y*wr own 


Practice 'jeurOWN 


30 


understand the context 


Exercise 1.11: let's reframe 

One component of the Big Idea is what is at stake for your audience. As we've 
discussed, this can be framed either in terms of benefits (what does your audience 
stand to gain if they act in the way you recommend?) or in terms of risks (what 
does your audience stand to lose if they don't act accordingly?). It is often useful 
to explore both the positive and negative framing as you think through which 
might work best for your specific situation. 

Consider the following Big Ideas and answer the accompanying questions to 
practice identifying and reworking how each is framed. 

BIG IDEA 1: We should increase incentives to complete our email survey so we 
can collect better quality data and gain a robust understanding of our customers' 
pain points. 

(A) Is this Big Idea currently positively or negatively framed? 

(B) What is the benefit or risk in this Big Idea? 

(C) How could you reframe it to be the opposite? 

BIG IDEA 2: We stand to miss our earnings per share target if we don't reallocate 
resources to support emerging markets now that revenue from our traditional line 
of business has plateaued. 

(A) Is this Big Idea currently positively or negatively framed? 

(B) What is the benefit or risk in this Big Idea? 

(C) How could you reframe it to be the opposite? 

BIG IDEA 3: Last quarter's digital marketing campaign resulted in the traffic and 
sales increases we expected: we should maintain current spend levels to achieve 
this year's sales goal. 

(A) Is this Big Idea currently positively or negatively framed? 

(B) What is the benefit or risk in this Big Idea? 

(C) How could you reframe it to be the opposite? 


what's the Big Idea? 


31 


Exercise 1.12: what's the Big Idea? 

We've undertaken a number of exercises to get you comfortable working your 
way through the Big Idea worksheet and also seeing potential solutions (Exercises 
1.3 and 1.5). These next couple of exercises are similar—we pose a scenario and 
ask you to complete the Big Idea worksheet—but there's no illustrative answer. 
Rather, it's up to you to critique and refine what you've created. 

You are the Chief Financial Officer (CFO) for a national retailer. You are responsible 
for managing the financial well-being of the company and your duties include 
analyzing and reporting on the company's financial strengths and weaknesses 
and proposing corrective actions. Your team of financial analysts just completed 
a review of Q1 and have identified that the company is likely to end the fiscal 
year with a loss of $45 million if operating expenses and sales follow the latest 
projections. 

Because of a recent economic downturn, an increase in sales is unlikely. Therefore, 
you believe the projected loss can only be mitigated by controlling operating ex¬ 
penses and that management should implement an expense control policy ("ex¬ 
pense control initiative ABC") immediately. You will be reporting the Q1 quarterly 
results at an upcoming Board of Directors meeting and are planning your com¬ 
munication—a summary of financial results in a PowerPoint deck—that you will 
present to the board with your recommendation. 

Your goals for the presentation are twofold: 

1. For the Board of Directors to understand the long-term implications of end¬ 
ing the year at a net loss, and 

2. Get agreement from the daily operating managers (CEO and executives) to 
implement "expense control initiative ABC" immediately. 

Complete the Big Idea worksheet on the following page for this scenario, mak¬ 
ing assumptions as necessary for the purpose of the exercise. 
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the BIG IDEA worksheet 

Identify a project you are working on where you 
need to communicate in a data-driven way. 

Reflect upon and fill out the following. p RO J ECT 

WHO IS YOUR AUDIENCE? 

List the primary groups or individuals to 3) What does your audience care about? 

whom you'll be communicating. 

4 What action does your audience need to take? 

(2) If you had to narrow that to a single person, 
who would that be? 


WHAT IS AT STAKE? 

What are the benefits if your audience acts What are the risks if they do not? 

in the way that you want them to? 


FORM YOUR BIG IDEA 

It should: 

articulate your point of view, 
convey what's at stake, and 
be a complete (and single!) 
sentence. 


FIGURE 1.12 The Big Idea worksheet 
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Exercise 1.13: what's the Big Idea (this time)? 

Let's do another practice run with the Big Idea worksheet. 

Imagine you're a rising university senior serving on the student government coun¬ 
cil. One of the council's goals is to create a positive campus experience by rep¬ 
resenting the student body to faculty and administrators and electing represen¬ 
tatives from each undergraduate class. You've served on the council for the past 
three years and are involved in the planning for this year's upcoming elections. 
Last year, student voter turnout for the elections was 30% lower than previous 
years, indicating lower engagement between the student body and the council. 
You and a fellow council member completed benchmarking research at other uni¬ 
versities and found that universities with the highest voter turnout had the most 
effective student government council at effecting change. You think there's op¬ 
portunity to increase voter turnout at this year's election by building awareness of 
the student government council's mission by launching an advertising campaign 
to the student body. You have an upcoming meeting with the student body pres¬ 
ident and finance committee where you will be presenting your recommendation. 

Your ultimate goal is a budget of $1,000 for the advertising campaign to increase 
awareness of why the student body should vote in these elections. 

STEP 1: Considering this situation, complete the following Big Idea worksheet, 
making assumptions as needed for the purpose of this exercise. (Don't overlook 
Steps 2 and 3 that follow it.) 
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the BIG IDEA worksheet 

Identify a project you are working on where you 
need to communicate in a data-driven way. 

Reflect upon and fill out the following. p RO J ECT 

WHO IS YOUR AUDIENCE? 

List the primary groups or individuals to 3) What does your audience care about? 

whom you'll be communicating. 

4 What action does your audience need to take? 

(2) If you had to narrow that to a single person, 
who would that be? 


WHAT IS AT STAKE? 

What are the benefits if your audience acts What are the risks if they do not? 

in the way that you want them to? 


FORM YOUR BIG IDEA 

It should: 

articulate your point of view, 
convey what's at stake, and 
be a complete (and single!) 
sentence. 


FIGURE 1.13 The Big Idea worksheet 
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STEP 2: Let's suppose you've just learned that your intended audience—the 
student body president—will not be attending the upcoming meeting due to 
a scheduling conflict. The vice-president will cover the meeting and approve or 
deny your budget request. In light of this, answer the following: 

(A) You don't know the vice-president well. What could you do to get to know 
her better? Identify one thing you can do immediately —before the meet¬ 
ing—to better understand what she cares about and one thing you'll do over 
your tenure on the council to better understand the vice-president's needs for 
future communications. 

(B) Revisit your framing of the Big Idea. Did you write it with a positive or 
negative focus? What may cause you to change to the opposite framing given 
this new audience? 

STEP 3: You'd like to solicit feedback on your Big Idea. You are deciding between 
two different people from whom to potentially get feedback: (1) your roommate 
or (2) a fellow council member. Answer the following: 

(A) What would be the advantages or disadvantages of each? 

(B) How do you anticipate the conversations would be different? 

(C) Who would you ultimately choose to solicit feedback from? Why? 
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Exercise 1.14: how could we arrange this? 

There are many ways we can organize the content that we present. Storyboarding 
allows us to plan the order and consider different arrangements in light of who our 
audience is and what we hope to achieve when we communicate with them. Look 
at the potential components of a storyboard in Figure 1.14 (which are presented 
in no particular order) and answer the following questions. 


BACKGROUND 

ANAWSIS 

recommendation 


data 

PROBLEM 

statement 

finding 


FIGURE 1.14 Potential components of a storyboard 

QUESTION 1: How would you arrange these components into a storyboard? 
(What would you start with? What would you end with? How would you order the 
topics in between?) What drives your decisions on how to order the content? 

QUESTION 2: Let's say you made some assumptions about the data as part of 
your analysis. At what point in your planned arrangement would you include this? 
Why is that? 

QUESTION 3: Assume you are presenting to a highly technical audience and 
anticipate there will be a lot of questions and discussion about the data and your 
analysis. Does this change how you would order your content? Are there addition¬ 
al elements you would include or remove? 

QUESTION 4: Imagine you have a solid understanding of the data, but that 
there is important context that your audience will need to contribute in order for 
everyone to understand the full picture. Does this affect how you would arrange 
your content? Where and how would you invite audience input? Would you add 
or remove any elements? 
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QUESTION 5: Assume you are presenting to senior leadership. You recognize 
that you'll only get a short amount of time (perhaps even shorter than your allot¬ 
ted time slot on the agenda). Does this change how you would order the content? 
Why or why not? 


Exercise 1.15: storyboard! 

For this exercise, we'll create a storyboard using the CFO's Q1 financial update 
introduced in Exercise 1.12. As a reminder, the background is as follows: 

You are the Chief Financial Officer for a national retailer. You are responsible for 
managing the financial well-being of the company and your duties include analyz¬ 
ing and reporting on the company's financial strengths and weaknesses and pro¬ 
posing corrective actions. Your team of financial analysts just completed a review of 
Q1 and have identified that the company is likely to end the fiscal year with a loss of 
$45 million if operating expenses and sales follow the latest projections. 

Because of a recent economic downturn, an increase in sales is unlikely. Therefore, 
you believe the projected loss can only be mitigated by controlling operating ex¬ 
penses and that management should implement an expense control policy ("ex¬ 
pense control initiative ABC") immediately. You will be reporting the Q1 quarterly 
results at an upcoming Board of Directors meeting and are planning your commu¬ 
nication—a summary of financial results in a slide deck—that you will present to 
the board with your recommendation. 

Your goals for the presentation are twofold: 

1. For the Board of Directors to understand the long-term implications of end¬ 
ing the year at a net loss, and 

2. Get agreement from the daily operating managers (CEO and executives) to 
implement "expense control initiative ABC" immediately. 

Look back to the Big Idea that you created in Exercise 1.12 (if you didn't create 
one, spend a few moments doing so now!). Complete the following steps with 
this Big Idea in mind. 

STEP 1: Brainstorm! In this first step, brainstorm what details you might possibly 
include in the eventual presentation. Get a blank piece of paper or a stack of 
sticky notes and start writing down ideas. Aim for a list of at least 20. To aid in the 
brainstorming process, ask yourself a few questions: Is this the first time you've 
introduced your Big Idea to this audience? Do you anticipate they will respond 
favorably or unfavorably? How frequently have they seen the data you'll show 
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them—is it a regular update or will you need to allow time to educate them on 
unfamiliar terms or methodology? Do you anticipate needing to get buy-in from 
the decision maker on your recommendations? If so, what data points need to be 
included to help with this process? 

STEP 2: Edit. Examine all the ideas you generated in Step 1. Identify which are 
essential and which can be discarded. Create your storyboard or the outline for 
the presentation. To aid in the editing and arranging process, ask yourself: having 
identified your audience's likely response in Step 1, will you start with the Big Idea 
or will you lead up to it at the end? Which details has the audience seen regularly 
that can possibly be discarded? What details are new to the audience and may 
require more time or data behind them? Are there pieces that can be combined? 

STEP 3: Get feedback. G rab a partner and have them complete this exercise, 
then get together and talk about it. How are your storyboards similar? Where 
do they differ? If you don't have a partner who has completed the exercise, you 
can still talk someone through your plan. What changes would you make to your 
storyboard after talking through it with someone else? Did you learn anything 
interesting through this process? 


Exercise 1.16: storyboard (again!) 

For this exercise, we'll critique and revise a storyboard using the university elec¬ 
tions example from Exercise 1.13. As a reminder, the background is as follows: 

Imagine you're a rising university senior serving on the student government coun¬ 
cil. One of the council's goals is to create a positive campus experience by rep¬ 
resenting the student body to faculty and administrators and electing represen¬ 
tatives from each undergraduate class. You've served on the council for the past 
three years and are involved in the planning for this year's upcoming elections. 
Last year, student voter turnout for the elections was 30% lower than previous 
years, indicating lower engagement between the student body and the council. 
You and a fellow council member completed benchmarking research at other uni¬ 
versities and found that universities with the highest voter turnout had the most 
effective student government council at effecting change. You think there's op¬ 
portunity to increase voter turnout at this year's election by building awareness of 
the student government council's mission by launching an advertising campaign 
to the student body. You have an upcoming meeting with the student body pres¬ 
ident and finance committee where you will be presenting your recommendation. 
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Your ultimate goal is a budget of $1,000 for the advertising campaign to increase 
awareness of why the student body should vote in these elections. 

QUESTION 1: Your fellow council member created the following storyboard 
(Figure 1.16) for the communication to the student body president and has asked 
for your feedback. Critique the storyboard with these questions in mind: 

(A) How is it currently ordered (chronological, leading with Big Idea, some¬ 
thing else)? 

(B) What points would you combine? What would you add? What would you 
remove? 

(C) How would you suggest revising the storyboard based on your critique? 
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FIGURE 1.16 University elections colleague's storyboard 
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QUESTION 2: You've now learned that the vice-president will be presiding over 
the meeting and will be deciding whether to approve your $1,000 advertising 
campaign. She is a busy woman and you know from others who have presented to 
her before that she will be hyper-focused when you are presenting but frequently 
has an overbooked schedule, often causing her to cut meetings short. In light of 
this primary audience change, reexamine your revised storyboard from Question 
1C. What factors would cause you to make changes to the flow? Would you add 
or remove components? 

QUESTION 3: Revisit the revised storyboard you created in 1C and answer the 
following questions: 

(A) Why did you decide to put the call to action in its current location? 

(B) In creating this storyboard, what were the advantages of using sticky notes 
over software? 

(C) What benefits did you get from creating this storyboard? 

You've practiced with me and on your own. Next, let's talk through how you can 
apply the strategies we've covered in your work. 
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Exercise 1.17: get to know your audience 

When communicating, it can be useful to start by identifying your primary audi¬ 
ence and reflecting on what is important from their perspective. Even if you don't 
know your audience, there are ways to get clarity on what drives them. Can you 
talk to them and ask questions to better understand their needs? Do you know 
people who are similar to your audience? Do you have colleagues who have suc¬ 
cessfully (or unsuccessfully) communicated to your audience and might have a 
perspective to offer? What assumptions can you make about what your audience 
cares about or motivates them, the biases they may have, whether they will find 
data important and if so, which, or how they may react to what you need to con¬ 
vey? As we've discussed, being clear on this can put you in a better position to 
successfully communicate. 

If you have a mixed audience where different segments or people care about 
different things, it can be useful to create groups of similar audiences and work 
through this exercise for each of them. In cases where you identify overlap in their 
needs, this can be a useful place from which to communicate. 

If you are making assumptions about your audience—and we nearly always are!— 
talk through these with a colleague or two. Do they agree with you? Have them 
help you identify and pressure-test your assumptions. Ask them to play devil's 
advocate and take an opposing viewpoint, so you can practice responding to this. 
The more you can do to anticipate how things could go wrong and prepare for 
that, the better off you'll be. 

Pick a project where you need to communicate something to somebody. Identify 
specific actions you can take to better get to know your audience and under¬ 
stand what is important to them. What assumptions are you making about your 
audience when you do this? How big of a deal will it be if those assumptions are 
wrong? How else can you prepare for the audience to whom you'll be communi¬ 
cating? List specific actions, then undertake them! 


PRACTICE at WORK 


42 


understand the context 


Exercise 1.18: narrow your audience 

As we've discussed, it can be useful to have a specific audience in mind when we 
communicate. This allows us to really target our communication. The following 
exercise will help you think through how you can narrow your audience. 

STEP 1: Consider a project where you need to communicate in a data-driven way. 
What is the project? 

STEP 2: Start by casting a wide net: list all of the potential audiences who may 
care about what you will be sharing. Write them down! How many can you come 
up with? 

STEP 3: Do you have them all? I bet there are more. See if you can add to the 
list you just made. 

STEP 4: Next, let's narrow. Read through the following questions and list the au- 
dience(s) that will care most in light of each of these things. 

(A) What did you learn through the data? Which audience(s) will care about this? 

(B) What is the action you are recommending? Who needs to take this action? 

(C) What point are we at in time—what needs to happen now? 

(D) Who is the ultimate decision maker or group of decision makers? 

(E) In light of all of the above, who is the primary audience(s) to whom you 
need to communicate? 
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Exercise 1.19: identify the action 

When we communicate for explanatory purposes, we should always want our au¬ 
dience to do something—take an action. Rarely is it as simple as, "We found x; 
therefore you should do y." Rather, there are often nuances that come into play in 
determining how explicit we should be with the next step we want our audience 
to take. In some situations, we may need input from them to help determine an 
appropriate course of action. In other cases, we want them to come up with the 
next step on their own. In any event, we—as the communicators—should be very 
clear on what we think that action should be. 

Consider a current project where you need to communicate something to an au¬ 
dience. List out the potential actions they could take based on the data you share. 
What is the primary action you want them to take? Be specific'—-suppose you will 
say the following sentence to your audience: 

"After reading my deck or listening to my presentation, you should 


If you're having trouble, scan the following list and contemplate whether any of 
these might apply or spark ideas: 

accept agree approve begin believe budget buy champion change 
collaborate commence consider continue contribute create debate 
decide defend desire determine devote differentiate discuss distribute 
divest do empathize empower encourage engage establish examine 
facilitate familiarize form free implement include increase influence invest 
invigorate keep know learn like maintain mobilize move partner pay for 
persuade plan procure promote pursue reallocate receive recommend 
reconsider reduce reflect remember report respond reuse reverse review 
secure share shift support simplify start try understand validate verify 
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Exercise 1.20: complete the Big Idea worksheet 

Identify a project you are working on where you need to communicate in a da¬ 
ta-driven way. Reflect upon and fill out the following. A fresh copy of the Big Idea 
worksheet can be downloaded at storytellingwithdata.com/letspractice/bigidea. 


the BIG IDEA worksheet 

Identify a project you are working on where you 
need to communicate in a data-driven way. 

Reflect upon and fill out the following. p RO J ECT 

WHO IS YOUR AUDIENCE? 

List the primary groups or individuals to 3) What does your audience care about? 

whom you'll be communicating. 


What action does your audience need to take? 

If you had to narrow that to a single person, 
who would that be? 


WHAT IS AT STAKE? 

What are the benefits if your audience acts What are the risks if they do not? 

in the way that you want them to? 


FORM YOUR BIG IDEA 

It should: 

articulate your point of view, 
convey what's at stake, and 
be a complete (and single!) 
sentence. 


FIGURE 1.20 The Big Idea worksheet 
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Exercise 1.21: solicit feedback on your Big Idea 

After you've crafted your Big Idea, the next critical step is to talk through it with 
someone else. 

Grab a partner, your completed Big Idea worksheet, and 10 minutes. If they aren't 
familiar with the Big Idea concept, have them read through the relevant section 
of SWD ahead of time, or simply tell them the three components (it should artic¬ 
ulate your point of view, convey what's at stake, and be a complete and single 
sentence). Prep your partner that you want candid feedback that will help you 
improve the overarching message you want to get across to your audience. Have 
them ask you a ton of questions so they can understand what you want to commu¬ 
nicate and help you achieve clarity through the words that you use. 

Read your Big Idea to your partner. From there, you can let the conversation take 
its natural course. If you're feeling stuck, refer to the following questions. 

• What is your overarching goal? What would success look like in this situation? 

• Who is your intended audience? 

• Is there any specialized language (words, terms, phrases, acronyms) that are 
unfamiliar or should be defined? 

• Is the action clear? 

• Have you framed what you want to happen from your perspective or from your 
audience's point of view? If the former, how could you reframe for the latter? 

• What is at stake? Will this be compelling for your audience? If not, how can 
you change it? "So what?" is always a good question to ask related to this— 
why should your audience care? What matters to them? 

• Are there other words or phrases that may enable you to more easily get your 
point across? 

• Can your partner repeat your main message back to you in their own words? 

• "Why?" is probably the best question your partner can ask—and continue to ask 
to get you to articulate your logic in a way that will help you refine your Big Idea. 

Revise your Big Idea during or following the conversation with your partner. If any¬ 
thing still feels rough or unclear, or if you'd simply like an additional perspective, 
repeat this exercise with someone new. 

See Exercise 9.7 in Chapter 9 for a facilitator's guide on how to run a formal Big Idea 
group session. Next, let's talk about how you can use the Big Idea on team projects. 
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Exercise 1.22: create the Big Idea as a team 

Are you working on a project as part of a team? Here's a great exercise to undertake 
to make sure everyone is aligned and working towards the same overarching goal. 

1. Give each person a copy of the Big Idea worksheet (download from 
storytellingwithdata.com/letspractice/bigidea) and have them individually 
work their way through it and craft their Big Idea with the given project in mind. 

2. Book a room with a whiteboard or start a shared document and write out 
each of the Big Ideas. Ask each person to read theirs aloud. 

3. Discuss. Where are there commonalities across the various statements? Is 
there anyone who seems out of alignment? What words or phrases best cap¬ 
ture the essence of what you want to communicate? 

4. Create a master Big Idea, pulling pieces from the individual ones and further 
augmenting and refining as needed. 

This exercise helps ensure that everyone is on the same page and creates buy-in 
as people see components of their Big Idea flowing into the master Big Idea. It 
can also spark some awesome conversations that help everyone become clear on 
and confident about what needs to happen. 


Exercise 1.23: get the ideas out of your head! 

Let's put into practice a good first step in the storyboarding process: brainstorm¬ 
ing. Consider a project where you need to create an explanatory communication 
like a slide deck. Get a stack of sticky notes and a pen. Find a quiet workspace 
with a large empty table or whiteboard. Set a timer for 10 minutes. Start the timer 
and see how many ideas you can get out of your head and onto stickies. You can 
imagine that each small square reflects a piece of potential content in your even¬ 
tual deck. That said, don't filter your thoughts—rather, let it be a cathartic process 
(there are no bad ideas during this step). Don't worry about order or how the 
ideas fit together at this point in the process. Simply see how many sticky notes 
you can fill up in a set amount of time. 

Tip: do this low-tech exercise after you've spent enough time with the data to know 
what you want to communicate with it, but before you start creating content with 
your computer. This exercise is best done after you've created, solicited feedback 
on, and refined the Big Idea for the given project (see Exercises 1.20 and 1.21). 

If you find you're still on a roll writing down ideas when your timer goes off, feel free 
to add more time. After you've completed this exercise, move on to Exercise 1.24. 
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Exercise 1.24: organize your ideas in a storyboard 

You've completed Exercise 1.23—the ideas are out of your head in writing on 
sticky notes—now it's time to organize them. Step back and think about what over¬ 
arching structure can help you tie everything together in a way that will make sense 
to someone else. It may be helpful to make additional stickies for meta topics or 
themes as you organize your ideas. Where can you group things together? What 
might you eliminate? 

Speaking of eliminating, start a discard pile. For each sticky note you consider, ask 
yourself: does this help me get my Big Idea across? If you can't come up with a 
good reason to include it, move it to the discard pile. 

Here are some specific questions to contemplate as you're determining what order 
could work best for your situation: 

• How will you present to your audience: are you there live, over the phone or 
through a webinar, or sending something out that will be consumed on its own? 

• What order will work well for getting your content across to your audience? 
Does it make sense to start with the action you want from them, build up to it, 
or something in between? 

• What context is essential? Does your audience need to know it up front, or 
does it better fit later? How quickly should you answer "So what?" 

• Do you already have established credibility with your audience, or do you 
need to build it? If so, how will you do that? 

• Were there assumptions made in your work? When and how should you intro¬ 
duce those? What if your assumptions are wrong? Does that materially change 
the message? 

• Do you need input from your audience? How and where can you best get that? 

• At what point does data fit in? Does the data confirm expectations or run counter 
to them? What data or examples will you integrate and where? 

• How can you best create common ground, get buy-in, and prompt action? 

There is no single right path, but the answers to the preceding questions will help 
you think through different options that could work well for your given circumstance. 
If there is a non-message impacting data or other content that you can't bear to get 
rid of, push it to the background—either physically by putting it later in the document 
(perhaps in the appendix) or visually by de-emphasizing it and putting emphasis on the 
most important components of what you need to communicate. 

Well look at additional strategies for ordering content when we discuss Story in Chapter 6. 
In the meantime, move on to Exercise 1.25 and get feedback on your storyboard. 
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Exercise 1.25: solicit feedback on your storyboard 

After creating your storyboard, talk through it with someone else. There are a 
couple of benefits to this. First, simply talking through it can be helpful. Doing so 
forces you to articulate your thought process, which can help illuminate alternate 
approaches. Second, sharing with someone else may introduce new perspectives 
or ideas that help you improve your storyboard. 

This can be free-form: create your storyboard and then simply talk through it with 
a partner. Let the questions and conversation take their natural course. If you're 
feeling stuck, or don't have a partner handy and want to simulate this exercise, ask 
yourself the following questions: 

• How are you presenting to your audience? Are you creating something they 
will consume on their own, or will you (or someone else) be presenting the 
material? 

• Do the overall order and flow make sense? 

• What is your Big Idea? Where will you introduce it? 

• Does your audience care about all of these pieces? 

• If there are pieces your audience cares less about but you still need to include, 
how can you keep their attention during this part? 

• Where could things go wrong? How can you prepare for that? 

• How will you transition from one topic or idea to the next? 

• Is there anything that could be cut? Added? Rearranged? 

If it makes sense to get stakeholder or manager feedback at this point, do it! This 
can be a great early check-in point to get confirmation that you're on the right 
track or redirect your efforts—before you've invested a ton of time. 

When you spend time up front identifying and getting to know your audience, 
crafting the main message, and storyboarding content, you emerge with a plan 
of attack. This both reduces iterations and helps you be more targeted with your 
communications. The resulting materials are typically shorter than they otherwise 
would have been. This leaves you more time to create quality content: slides and 
graphs. We'll turn our attention there in the next chapter. 
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Exercise 1.26: let's discuss 

Consider the following questions related to Chapter 1 lessons and exercises. Dis¬ 
cuss with a partner or group. 

1. What audiences do you communicate to regularly? What do the various au¬ 
diences have in common? How are they different? How can you take the 
needs of your audience into account when communicating with data? 

2. Do you face a mixed audience when communicating with data? What are the 
main groups that make up this audience? Do you have to communicate to 
all of them at once? Are there any ways to narrow for purposes of commu¬ 
nication? How can you set yourself up for success? Do others have related 
experience or learnings to share? 

3. Reflect on the Big Idea and the practice of distilling your message down to 
a single sentence. How did you find the related exercises in this chapter? In 
what situations does it make sense to take the time to craft the Big Idea in 
your work? Have you tried this on the job? Was it helpful? Did you encounter 
any challenges? 

4. Why are sticky notes good tools for storyboarding? Do you have other useful 
or recommended methods for planning content for your communications? 

5. What tip or exercise did you find most useful in SWD or this book on the 
planning process for communicating effectively? Which strategies have you 
employed? Were you successful? What learnings will you put into practice 
going forward? 

6. Was there anything covered in this section that didn't resonate or that you 
don't think will work in your team or organization? Why is that? Do others 
agree or disagree? 

7. Are there things you believe your work group or team should do differently 
related to the planning process for communicating? How can you make that 
happen? What challenges do you anticipate related to this and how can you 
overcome them? 

8. What is one specific goal you will set for yourself or your team related to the 
strategies outlined in this chapter? How can you hold yourself (or your team) 
accountable to this? Who will you turn to for feedback? 
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Once you've taken time to understand the context and planned your communi¬ 
cation in a low-tech fashion, as we practiced in Chapter 1, comes the question: 
when I have some data I need to show, how do I do that in an effective way? This 
is the topic we'll tackle next. 

There is no single "right" answer when it comes to how to visualize data. Any data 
can be graphed countless different ways. Often, it takes iterating—looking at the 
data one way, looking at it another way, and perhaps even another—to discover a 
view that will help us create that magical "ah ha" moment of understanding that 
graphs done well can do. 

Speaking of iterating, we have some forthcoming exercises that will encourage 
you to do just that. Through the exercises in this chapter, we'll create and evaluate 
a number of different types of graphs, helping us understand both the advantages 
and limitations of different individual pictures of the data. Our go-tos will mainly 
be the usual suspects—lines and bars—but we'll look at some twists on graph 
types introduced in SWD as well. 

Let's practice choosing an effective visual! 

First, we'll review the main lessons from SWD Chapter 2. 
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Exercise 2.1: improve this table 

Frequently, when we first aggregate our data, we put it into a table. Tables allow 
us to scan rows and columns, reading the data and comparing the numbers. Let's 
look at an example table and explore both how we can improve it and take things 
a step further to visualize the data it contains. 


Figure 2.1a shows the breakdown of new clients by tier for the recent year. Use 
this table to complete the following steps. 


New client tier share 


Tier 

# of Accounts 

% Accounts 

Revenue ($M) 

% Revenue 

A 

77 

7.08% 

$4.68 

25% 

A+ 

19 

1.75% 

$3.93 

21% 

B 

338 

31.07% 

$5.98 

32% 

C 

425 

39.06% 

$2.81 

15% 

D 

24 

2.21% 

$0.37 

2% 


FIGURE 2.1a Original table 

STEP 1: Review the data in Figure 2.1a. What observations can you make? Do 
you have to make any assumptions when interpreting this data? What questions 
do you have about this data? 

STEP 2: Consider the layout of the table in Figure 2.1a. Let's assume you've been 
told this information must be communicated in a table. Are there any changes you 
would make to the way the data is presented or the overall manner in which the 
table is designed? Download the data and create your improved table. 

STEP 3: Let's assume the main comparison you want to make is between how 
accounts are distributed across the tiers compared to how revenue is distributed— 
and that you have the freedom to make bigger changes (it's not required to be a 
table). How would you visualize this data? Create a graph in the tool of your choice. 
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Solution 2.1: improve this table 

STEP 1: When I encounter this table, I start reading and scanning down columns 
and across rows. In terms of specific observations, I might start by noticing that 
the majority of accounts are in Tiers B and C, while Tiers A and Ah— though they 
don't make up a huge number (or percentage) of accounts—do make up a mean¬ 
ingful amount of revenue. In terms of questions, I wonder if the tiers are in order: 

I would think A+ belongs above A and am confused that they don't appear that 
way in the table (perhaps due to alphabetical sorting?). 

I wish there was a "Total" row at the bottom, because in the absence of this I find 
myself wanting to add up numbers. In fact, it's when I start to do that when I notice 
some bigger issues. The third column (% Accounts)—which I assume means per¬ 
cent of total accounts—sums to 81.16%. The final column (% Revenue)—which I 
assume means percent of total revenue—sums to 95%. So now I'm unsure wheth¬ 
er these really are percent of total or something else. If they are, then there must 
be some "Other" or "Non-tier" category that I'd want to include in order to have 
the full picture. 

When I focus on the numbers themselves, two digits of significance (places past 
the decimal point) seem like a lot for the % Accounts column given the scale of 
the numbers. When showing data like this, you should be thoughtful about the 
appropriate level of detail. There isn't necessarily a single "right" answer, but you 
want to avoid too many digits of significance. This can make the numbers them¬ 
selves harder to interpret and recall and may convey a false sense of accuracy. Is 
the difference between 7.08% and 7.09% meaningful? If not, we can drop a digit 
by rounding. Here, given the scale of the numbers and differences between them, 
I would round to whole numbers across all except the fourth column depicting 
revenue. There we are already summarizing in millions and it seems like we would 
lose important differences between the dollar volumes by rounding to a whole 
number, so there I'd round to one digit past the decimal point. 

Figure 2.1 b is an improved table that addresses the preceding points. 


New client tier share 


Tier 

# of Accounts 

% Accounts 

Revenue ($M) 

% Revenue 

A+ 

19 

2% 

$3.9 

21% 

A 

77 

7% 

$4.7 

25% 

B 

338 

31% 

$6.0 

32% 

C 

425 

39% 

$2.8 

15% 

D 

24 

2% 

$0.4 

2% 

All other 

205 

19% 

$0.9 

5% 

TOTAL 

1,088 

100% 

$18.7 

100% 


FIGURE 2.1b Slightly improved table 
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STEP 2: There are additional improvements I can make to this table. When tables 
are designed well, the actual design fades to the background so that we focus on 
the numbers in a way that makes sense. I recommend against shading every other 
row and instead am an advocate for white space (and limited light borders) to set 
apart columns and rows as needed. Speaking of white space, I typically avoid cen¬ 
ter-aligned text in graphs (because it creates hanging text and jagged edges that 
look messy) in favor of left- or right-aligning text. In the case of tables, however, 

I do sometimes opt for center alignment because of the separation this creates 
between columns (another common practice in tables is to right-align numbers or 
align by decimal point, which allows you to easily eyeball relative size). I can group 
the accounts-related columns and revenue-related columns with a single title (and 
under that, number and percent), which will reduce some redundancy of titles and 
also give me more space to be specific about what the columns represent. Doing 
so also allows me to make the columns narrower so the table overall takes up less 
space. These are some specific tips—I'll also put forth a couple of more general 
ones: consider the zigzagging "z" and where your eyes are drawn. 

Consider the zigzagging "z": Without other visual cues, your audience will typi¬ 
cally start at the top left of your visual (for example, your table) and do zigzagging 
"z's" across to take in the information. When we think about applying this to how 
we design our tables, it means you want to put the most important data at the top 
and at the left—when you can do so in the context of the overall data in a way that 
makes sense. In other words, if there are super-categories or data that needs to 
be taken into account together, keep them in the order that makes sense. In this 
particular example, I'd sort my tiers starting with the top (that is indeed A+) and 
decreasing as we move down the table. Going left to right, I'm happy enough with 
the way it is structured. I want to keep the distribution of accounts and percent of 
accounts next to each other since those relate to each other. If revenue were more 
important than accounts, I could move the two revenue columns leftwards, but I 
can also use other ways to focus attention there. Let's discuss that next. 

Where are your eyes drawn? Similarly to how we focus attention thoughtfully in 
graphs as part of explanatory analysis (something we'll explore in detail in Chapter 
4), we can also focus our audience's attention in tabular data to establish hierarchy 
of information. This can be especially useful in instances where you can't put the 
most important stuff leftward or at the top (because other constraints dictate the 
ordering). Despite this, you can still indicate relative importance to your viewers. 
Look back to Figure 2.1b: where are your eyes drawn? Mine go to the very first 
row where the column titles are Tier, # of Accounts, and so on. This isn't even the 
data! Rather than use up ink and draw attention there, I can be conscious about 
where I want to direct attention in the data and take intentional steps to get my 
audience to look there. This can be done through sparing use of color or by out¬ 
lining a specific cell or column or row. Adding visual aspects to some of the data 
in the table is another way to draw attention there: colors and pictures grab our 
attention when they are used judiciously. 
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If we assume the primary comparison we'd like our audience to make is between 
the distribution of % Accounts compared to % Revenue, I could apply heatmap¬ 
ping (using relative intensity of color to indicate relative value) to just those two 
columns. See Figure 2.1c. 


New client tier share 


TIER 

ACCOUNTS 

REVENUE 

# 

% OF TOT 

$M 

% OF TOT 

A+ 

19 

2% 

$3.9 


A 

77 

__ 

$4.7 

25% 

B 

338 

I 31% 

$6.0 

32% 

C 

425 

1 39% 

$2.8 


D 

24 

2T4 

$0.4 

2% 

All other 

205 


$0.9 


TOTAL 

1,088 

100% 

$18.7 

100% 


FIGURE 2.1c Table with heatmapping 


As another approach, I could embed horizontal bar charts in place of the heat¬ 
mapping. See Figure 2.Id. This does work quite well to direct attention to those 
columns and allows us to see how the shape of the distribution varies across the 
two. However, the specific comparison between % Accounts and % Revenue for a 
given tier is harder, since these bars aren't aligned to a common baseline. Tip: If 
you are working in Excel, conditional formatting is available that will allow you to 
create heatmapping or embedded bars in a table with ease. 


New client tier share 


TIER 

ACCOUNTS 
# % OF TOT 

A+ 

19 1 

A 

77 ■ 

B 

338 

C 

425 

D 

24 I 

All other 

205 

TOTAL 

1,088 100% 


REVENUE 



$0.4 | 

$0.9 ■ 

$18.7 100% 


FIGURE 2. Id Table with embedded bars 


STEP 3: Let's take it a step further and focus on the data that is in the bars in Fig¬ 
ure 2.Id and review some different ways we could graph it. When I hear a term like 
"percent of total," it makes me think of parts of a whole—which might cause us to 
look to the pie chart. In this case, since we are interested in both % of Accounts and 
% of Revenue, we could depict this with a pair of pies. See Figure 2.1 e. 
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New client tier share 

% of Total Accounts % of Total Revenue 



FIGURE 2. 1e Apairofpies 

I'm not a big fan of pies—•! sometimes joke that there's one thing worse than a 
single pie: two pies! 

Let me back up, though, and say that pies can work well if we want to make the 
point that one piece of the whole is very small, or another piece of the whole is 
very big. The challenge for me is that pies break down pretty quickly if we want to 
say anything more nuanced than that. This is because our eyes' ability to accurate¬ 
ly measure and compare areas is limited, so when the segments are similar in size, 
it is difficult for us to assess which is bigger or by how much. If that's a comparison 
that is important, we'll want to represent it differently. 

In this instance, the primary comparison we want our audience to make is be¬ 
tween the various segments in the pie on the left and those in the pie on the 
right. This is difficult for two reasons: the area challenge mentioned above and 
the spatial separation between pies. This is further compounded by the fact that 
the segments are in different places on the right as a result of how the data differs 
between the breakdown on the left compared to the right. Basically, if any of the 
data is different between the pies (which it should be if we have something inter¬ 
esting to say about it!) then all the pieces are in different places across the two 
pies—making them hard to compare. In general, you want to identify the primary 
comparison you want your audience to make and put those things as physically 
close together and align to a common baseline to make that comparison easy. 

Let's start by aligning each measure to its own baseline, with a view similar to the 
bars embedded in the table previously. See Figure 2.If. 
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New client tier share 



All other 


FIGURE 2.If Two horizontal bar charts 

In Figure 2.If, it's very easy for us to compare the % of Total Accounts across tiers. 
It's also easy to compare the % of Total Revenue across tiers. I can attempt to 
compare accounts to revenue, but this is harder because they aren't aligned to a 
common baseline. If I wantto allow for that as well, then I could pull both of these 
series into a single graph. See Figure 2.1 g. 

New client tier share 

TIER % OF TOTAL ACCOUNTS vs. REVENUE 

0% 10% 20% 30% 40% 



FIGURE 2.1 g Horizontal dual series bar chart 

With the arrangement in Figure 2.1 g, the easiest comparison for me to make 
is, for a given tier, the % of Total Accounts compared to the % of Total Revenue. 
These elements are both the closest together and they are aligned to a common 
baseline. Bingo! 
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We could also flip this graph on its side into a vertical bar chart, or column chart. 
See Figure 2.1 h. 


New client tier share 

% OF TOTAL ACCOUNTS vs. REVENUE 

40% I 

35% j 



A+ A B C D All other 

TIER 


FIGURE 2.1h A vertical bar chart 

When we depict data in this manner, the primary comparison our eyes are making 
is the endpoints of the paired bars relative to each other and to the baseline. Let's 
draw some lines to further highlight this comparison. See Figure 2.1 i. 

New client tier share 

% OF TOTAL ACCOUNTS vs. REVENUE 



A+ A B C D All other 

TIER 


FIGURE 2.1i Let's draw some lines 


PRACTICE k/4 COLE 



PRACTICE \m'M, COLE 


62 


choose an effective visual 


Now that we've drawn the lines, we don't need the bars anymore. I've removed 
those in Figure 2.1j. 


New client tier share 


% OF TOTAL ACCOUNTS vs. REVENUE 
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FIGURE 2.1 j Take away the bars 

Next, I'll collapse all of these lines and label everything directly. This yields the 
slopegraph shown in Figure 2.1k. 

New client tier share 



TIER 

32% B 

25% A 
21% A+ 

15% C 


5% All other 
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% OF TOTAL % OF TOTAL 

ACCOUNTS REVENUE 


FIGURE 2.1k A slopegraph 
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Slopegraph is really just a fancy word for a line graph that only has two points in 
it. By drawing lines between the % of Total Accounts and % of Total Revenue for a 
given tier, we can quickly see where the two measures differ. Revenue as a propor¬ 
tion of total is quite a lot lower for Tier C and All other (indicated by lines sloping 
downwards), while revenue as a proportion of total is much higher for tiers A+ and 
A. In other words, though A+ and A make up a very small proportion of accounts 
(9% combined), together they account for nearly 50% of revenue! 

We've looked at a number of ways to visualize this data. You likely made your own 
observations along the way about what worked well and what did not. What I've 
illustrated isn't exhaustive; I could have added a dot plot to the mix or calculated 
revenue per account and visualized that. That said, we don't typically have to 
go through every possible view of the data to find one that works. Perhaps both 
absolute values and percent of total are important, in which case the table might 
be the easiest way to show these different measures after all. If we can narrow our 
focus to a specific comparison or two, or a specific point we want to make, that 
will help us choose a way to show the data that will facilitate this. 

Any data can be graphed countless different ways. This exercise illustrates how 
moving through different representations of our data allows us to more (or less) 
easily see different things. Allow yourself time to iterate and complete the addition¬ 
al exercises that will give you more practice at this important junction in the process! 
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Exercise 2.2: visualize! 

Let's look at another table. The following shows the number of meals served each 
year as part of a corporate giving program. Spend a moment looking at the data. 
What is interesting about it? 


Meals served over time 


Campaign Year 

Meals Served 

2010 

40,139 

2011 

127,020 

2012 

168,193 

2013 

153,115 

2014 

202,102 

2015 

232,897 

2016 

277,912 

2017 

205,350 

2018 

233,389 

2019 

232,797 


FIGURE 2.2a Table showing meals served over time 


Notice how much work it is to process a column of numbers like this. We read data 
that is presented to us in tabular form, which—though this may seem like a simple 
way to show the numbers—actually takes a ton of brainpower! When I scan these 
numbers, I see the jump from 2010 to 2011 , and another between 2013 and 2014. 
You probably did, too. But if you're like me, it means you started at the top of the 
table and got there by scanning down the second column—comparing each new 
number to the one(s) before it. 


Let's practice easing how hard our brains must work by making the data more 
visual. Download this data. Create the following visuals in the tool of your choice. 

STEP 1 : Apply heatmapping to the second column of values. 

STEP 2: Create a bar graph. 

STEP 3: Create a line graph. 

STEP 4: Choose: which of the visuals you've created do you like best? Are there 
any other ways you would graph this data? 
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Solution 2.2: visualize! 

Pretty much anything we do to visualize the data originally shown in table form in 
Figure 2.2a is going to make it more quickly understandable. Let's check out a few 
ways we can ease the processing. 

STEP 1: First, let's apply some heatmapping. Most graphing applications have 
built-in functionality that will allow you to do this with ease. You can pick colors 
and choose how to apply them to the data. For example, I've created the follow¬ 
ing in Excel by applying conditional formatting to the second column of values. 

I indicated a 3-color scale, with lowest value white, 50th percentile light green 
and maximum value green. In some situations, you could add a legend to make 
it clear how to interpret the colors. In this case, I just want to give a general sense 
that more intense color represents bigger values and vice versa. Eyeballing it, this 
sense is intuitive given the numbers and relative intensity within the same hue. 


Meals served over time 


Campaign Year 

Meals Served 

2010 

40,139 

2011 

127,020 

2012 

168,193 

2013 

153,115 

2014 

202,102 

2015 

232,897 

2016 

277,912 

2017 

205,350 

2018 

233,389 

2019 

232,797 


FIGURE 2.2b Table with heatmapping 

In Figure 2.2b, I'm perhaps more inclined to notice how much lower the num¬ 
ber of meals served in 2010 was—it's totally white'—less than a third of the next 
closest number! I can also quickly observe that 2016 had the greatest number of 
meals served without having to read the numbers. The relative intensity of color 
helps me more quickly interpret the relative quantitative values. 

Related to this, 1 should point out that our eyes are pretty good at picking out big 
differences in intensity, but we have a harder time with more minor differences. This 
means that if there is something interesting about all of those medium shades of 
green, that's a little harder to quickly grasp and I might want to find a way to more 
fully visualize the values. Let's do that next. 

STEP 2: Figure 2.2c shows a bar graph I could make based on this data. I chose 
to keep the y-axis for reference. Almost instantly, we can get a general sense of 
magnitude of the various bars. I thickened the bars from the default graph so 
there is less of a gap between them, which makes it easier for my eyes to follow 
along the tops of the bars and compare them to each other. I like the idea of bars. 
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We do have a continuous variable on the x-axis (time), but we can categorize it 
into years, which may make sense if we want to focus on a specific year at a time 
and have clear demarcation between the years. 

Meals served over time 

g 300,000 

■ 

“ 250,000 

.lllllllll 

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 

CAMPAIGN YEAR 


FIGURE 2.2c Barchart 

STEP 3: We can also show this data as a line graph; see Figure 2.2d. In this it¬ 
eration, I decided to omit the y-axis and instead labeled only the beginning and 
end data points. This makes it easy (and obvious) for my audience to compare the 
number of meals served in 2010 to 2019. The rest of the values would have to be 
visually estimated. If there are other values you thought your audience would be 
particularly interested in (for example, the high point in 2016), you could also add 
data markers and labels to those points specifically. 

When I remove the y-axis, I'll often use the subtitle space for the axis title. Here, 
given the graph title, you could probably argue that the subtitle is redundant and 
perhaps unnecessary. I'd rather be explicit so there is no question for my audience 
about what they are viewing. That said, another reasonable person might make a 
different decision. 

I've used green in the visuals in this exercise, mainly to make it clear that—while I 
often default to blue—blue certainly isn't our only choice when it comes to using 
color in our visuals. We'll talk more about color as part of the exercises in Chapter 4. 
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Meals served over time 

# OF MEALS SERVED 



2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 

CAMPAIGN YEAR 

FIGURE 2.2d Line graph 

You inevitably made different design choices with your heatmap, bar graph, and 
line chart, and that's totally fine. The examples here and throughout are meant to 
be illustrative, not prescriptive. We'll look more specifically at aspects of design 
in Chapter 5. 

STEP 4: Which do I like best? When I look back over the visuals I created, I'm 
surprised at my own answer to this one. Going into it, I thought for sure I'd pre¬ 
fer the line. It's the cleanest and it takes up the least amount of ink. But seeing 
them together, and taking the limited context into account, I actually prefer the 
bar chart (Figure 2.2c). If there is a clear start and end to the program within each 
year, I'd provide this segmented picture. That said, I do think the overall trend is 
easier to see with the line graph. Additionally, if there were context that I wanted 
to annotate via text on the graph, I'd likely choose the line, which has more space 
to accommodate this. 

As we saw in solution 2.1, this is another illustration that there is no single right 
approach for visualizing data. Two different people faced with the same data vi¬ 
sualization challenge may opt for different approaches. Of utmost importance is 
that we are clear about what we want to enable our audience to see and choose 
a view that will help facilitate that. 
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Exercise 2.3: let's draw 

One of the best tools we all have at our disposal when visualizing data is a blank 
piece of paper. If I'm ever feeling stuck or am looking for a creative solution, I get 
out a fresh sheet and start sketching. You don't need to be an artist to reap the im¬ 
portant benefits of drawing. When working on paper, we remove the constraints 
of our tools (or what we know how to do with our tools). We are also less likely to 
form attachment to our work (the way we do after we've taken the time to create 
it with our computer). There's also simply something about empty space waiting 
to be filled that can help spark creativity. 

Let's do a quick practice exercise using this important instrument: paper. The fol¬ 
lowing graph (Figure 2.3a) shows capacity and demand measured in number of 
project hours over time. It is currently graphed as a horizontal bar chart. But is this 
the only way to show this data? Certainly not! 

Get a blank piece of paper and set a timer for 10 minutes. How many different 
ways can you come up with to potentially visualize this data? Draw them! (Don't 
worry about plotting every specific data point exactly'—make it quick and dirty to 
get an overall sense of what each visual could look like.) When the timer goes off, 
look over your sketches. Which do you like best and why? 


Demand and Capacity by Month 


2019-04 

2019-05 

2019-06 

2019-07 

2019-08 

2019-09 

2019-10 

2019-11 

2019-12 


29,263 


28,037 


21,596 


I 25,895 


25,813 


22,427 


23,605 



46,193 
49,131 
[] 50,124 
48,850 


ZJ 47,602 
43,697 


41,058 


37,364 

_ 34,364 

□ CAPACITY O DEMAND 


FIGURE 2.3a Let's draw this data! 



















let's draw! 


69 


Solution 2.3: let's draw! 

After 10 minutes, my paper is filled with six different ways to depict the data. See 
Figure 2.3b. 


a I fi r, /CAPAOT/ 
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capacity 
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FIGURE 2.3b My data drawings 

Starting at the top left, my initial sketch simply turns the horizontal bars upright 
so that time can move from left to right along the x-axis in a way that is intuitive. 
My second graph (top right) turns the bars into lines. I find it easier to focus on 
the gap with this view. But I wanted to do some more playing with bars of various 
forms, so my third iteration (middle left) goes back to those. I made Demand thin¬ 
ner and behind Capacity, in hopes that this would make it clear how much we're 
meeting out of the potential of what could be met. As a twist on that, we could 
stack the bars, which I've done in the middle right picture. The stacked series be¬ 
comes the Unmet Demand (note that this stacked version only works if Demand 
is always greater than or equal to Capacity—it would get tricky if Demand were 
to fall below Capacity). My penultimate view (bottom left) re-envisions the bars as 
dots and connects them to bring attention to the difference (this would still work 
if Demand falls below Capacity so long as we've made the Demand circles distinct 
from those representing Capacity, for example, by coloring them differently). My 
final illustration simply plots the trend of Unmet Demand. With this last one, we 
lose the context of the overall magnitude of Demand and Capacity, but depending 
on our goals, that may be okay. 

When it comes to which I like best, I prefer the stacked bars (middle right) if De¬ 
mand is always higher than Capacity, as it is for the data we're graphing. That said, 

I think any of these views could potentially work. There are definitely other ways 
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to show this data as well. Compare your drawings to mine. Did you come up with 
any similar graphs? Where do our ideas differ? Which do you like best out of the 
full group (yours plus mine)? 

Let's continue working with this data and determine how we can make one of the 
sketches come to life in our tools! Move on to Exercise 2.4. 


Exercise 2.4: practice in your tool 

Consider the sketches created as part of Exercise 2.3—both the ones you drew 
and the ones I sketched. Pick one (or more for extra credit!), download the data, 
and create in the tool of your choice. 
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Solution 2.4: practice in your tool 

I'm an overachiever, so I created all of the views I drew by hand in Excel. See 
Figures 2.4a - 2.4f. 

Basic bars. First is the basic bar graph, or column chart. See Figure 2.4a. I've in¬ 
tentionally filled in Capacity and left Demand as an outline to try to visually differ¬ 
entiate what we're able to meet compared to the unmet capacity. I don't love this 
graph—and I think I like it less than I did in my drawn version. I appreciate the idea 
of just having the outline for Capacity, yet I find the outline plus the white space 
between the bars visually jarring. I also feel this is the view out of all of them that 
directs the least amount of attention to the gap between Capacity and Demand, 
which seems like an important aspect of this data. 

In this case I chose to use the subtitle space for my legend. I'll sometimes do this 
if there isn't an obvious place to label the data directly. As an alternative, I could 
try directly labeling the first or last set of bars and using those as my legend. 


Demand vs capacity over time 

DEMAND CAPACITY 
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FIGURE 2.4a Basic bars 


Line graph. The line graph is a cleaner design compared to the bars, because it 
simply takes up less ink. I chose to label the lines (and also added data labels) 
at the ends of the lines, eliminating any confusion over which series is which and 
reducing the work of going back and forth between a legend and the data. I like 
that the line allows us to focus on either Capacity or Demand, and it also makes 
the comparison of them really easy, so we can see the gap between the lines and 
quickly identify where it is growing and where it is shrinking. I bolded the Capac- 
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ity line so our attention would go there first, then see the context of the greater 
Demand. See Figure 2.4b. 


Demand vs capacity over time 
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FIGURE 2.4b Line graph 


Overlapping bars. We go back to bars in Figure 2.4c, which shows an atypical 
approach: making the bars overlap. I've made the Capacity series slightly trans¬ 
parent so that it is clear that the Demand series starts at zero and isn't meant to 
be interpreted as being stacked. 

I like this iteration better than I anticipated when I sketched it on paper. That said, 

I could imagine an audience might find it confusing or off-putting since it doesn't 
look like a typical bar chart. If I wanted to use this graph, it would be a good one 
to show to a couple people and get feedback to see whether others find it con¬ 
fusing or if it could get the job done. 
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Demand vs capacity over time 
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FIGURE 2.4c Overlapping bars 


Stacked bars. With stacked bars, I kept Capacity plotted at the baseline, but then 
changed the second series to Unmet Demand so it could be stacked on top. I 
switched my emphasis to Unmet Demand, making it blue and rendering Capacity 
in a light grey. I like this view. 


Demand vs capacity over time 
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FIGURE 2.4d Stacked bars 
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Dot plot. This is another view that could catch my audience off guard. It feels 
intuitive to me, but I have to recognize that pretty much any way I graph this data 
is going to feel intuitive because I've spent time with the data: I know what it rep¬ 
resents and what I want my audience to take away. It may not be as obvious to them, 
however. Again, soliciting feedback would be a good way to test and assess. 

While I'm not sure I love this one, I am impressed at my own Excel wizardry used 
to create it. The circles are actually data markers on two line graphs (one for De¬ 
mand, another for Capacity) where I've chosen not to show the actual line and 
made the data markers huge so I'd have room to center the data label within 
each point. The shaded region that connects the dots is Unmet Demand, which is 
a stacked bar that sits on top of a second inclusion of the Capacity series (unfilled 
so you don't see the bottom series in the stack). This is what I call brute force Excel 
at its finest! 
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FIGURE 2.4e Dot plot 

Graph the difference. My final view is a simple line graph that plots the Unmet 
Demand (Demand minus Capacity). This is my least favorite of all (or maybe I'd rate 
it as tied for last with the basic bars), as it feels like too much context is omitted 
when we go from the two data series to plotting the difference. See Figure 2.4f. 
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Unmet demand over time 
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FIGURE 2.4f Graph the difference 

How did the visual(s) you made in your tool turn out? Which out of all of them do 
you like best and why? 

In absence of any context, I'd choose the Stacked Bar in Figure 2.4d. I like that 
it's easy to see both how Unmet Demand and Capacity are changing over time 
and I appreciate how this view makes it easy to focus attention on the decreasing 
Unmet Demand. 

We will revisit this data in context of the broader dashboard it originated from in 
Chapter 6. 
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Exercise 2.5: how would you show this data? 

The following table shows attrition rate for a 1-year associate training program for 
a given company. Spend a moment familiarizing yourself with it, then answer the 
following questions. 


Year 

Attrition Rate 

2019 

9.1% 

2018 

8.2% 

2017 

4.5% 

2016 

12.3% 

2015 

5.6% 

2014 

15.1% 

2013 

7.0% 

2012 

1.0% 

2011 

2.0% 

2010 

9.7% 

AVG 

7.5% 


FIGURE 2.5a Attrition overtime 


QUESTION 1: How many different ways can you come up with to show this data? 
Draw or create in the tool of your choice. 


QUESTION 2: How would you show the average in the various views you've created? 


QUESTION 3: Which of the visuals you've created do you like best and why? 
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Solution 2.5: how would you show this data? 

QUESTIONS 1 & 2: There are many potential ways we could show this data, de¬ 
pending on our audience and our goals. I came up with six different potential vi¬ 
suals and integrated the Average into each. Let's review and discuss each of these. 

Simple text. Just because we have numbers doesn't mean we need a graph! In 
some situations we can simply communicate a number or two. For example, I 
could summarize all of this data by saying, "The attrition rate for this program 
has averaged 7.5% over the past ten years." This doesn't give us any sense of 
the range over time, or basis for comparison, which in some cases would be sim¬ 
plifying too much. If that's important, perhaps I could say something like, "The 
attrition rate has varied from 1% to 15% over the past decade, and was 9.1% in 
2019." Or if I want to focus on more recent data, which may be more relevant, I 
could say, "The attrition rate for this program has increased in recent years, from 
4.5% in 2017 to 9.1% in 2019." 

Each time you create a visual, come up with a sentence that answers the question, 
"So what?" (Exercises 6.2, 6.7, 6.11, 7.5, and 7.6 will ask you to do this explicitly.) 
You may find that you can communicate with that sentence, eliminating the need 
for the graph altogether. When you do have more data you need to communicate, 
consider what context is helpful and how you can visualize it. Let's look at some 
ways to graph the data next. 

Dot plot. I can use points to illustrate attrition rate (y-axis) by year (x-axis). I incor¬ 
porated the average by adding a line to the graph, which allows us to easily see 
when we've been above and below average overtime. See Figure 2.5b. 
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FIGURE 2.5b Dot plot 
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Line graph. Rather than plot the points, I could visualize the lines that connect 
them so we can more easily see the trend over time. Figure 2.5c illustrates this. I 
retained the thin dotted line for the Average, but moved my labeling of it (and 
also abbreviated) so that it would better fit given this new layout of the data. I 
also chose to put a data marker and label on the final data point. This makes the 
comparison between the most recent point of data and the average an obvious 
one for my audience. 

Attrition rate over time 
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FIGURE 2.5c Line graph 

I tried a second iteration on the line graph, using shaded area for the Average, 
rather than a line. See Figure 2.5d. I prefer the original line view in Figure 2.5c, 
but I could envision scenarios with different data that might cause me to choose 
another approach. 
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Attrition rate over time 



FIGURE 2.5d Line graph with shaded area depicting average 

Area graph. After trying out area for the Average, I decided to switch it up and 
depict the Attrition Rate with area and revert back to a line for the Average. See 
Figure 2.5e. I chose a lighter blue for the Average line so it would show up both 
against the empty white background as well as when it overlaps the area encod¬ 
ing attrition rate. In each visual, I've labeled the Average differently—this is mainly 
due to the space available and the shape of it. Alternate views of the data may 
cause you to make other design modifications like this as well. 

I don't love this one. It takes up a lot of ink for what we're trying to show and 
makes it seem like there's something important about the area under the curve, 
which isn't the case here. I don't use a lot of area graphs in general. 
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FIGURE 2.5e Area graph 
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Bar graph. Finally, I tried plotting this data as a bar chart. See Figure 2.5f. I pre¬ 
served the Average as a line, labeling it again differently from prior views given 
the layout of the overall graph. 

Attrition rate over time 
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FIGURE 2.5f Bar graph 

QUESTION 3: Which do I like best? I'm happy with the preceding bar chart, but 
I like the line graph in Figure 2.5c best of all. Connecting the dots via lines makes 
it easy to see the trend in Attrition Rate over time. I can easily compare it to the 
Average. It doesn't use a lot of ink, which leaves me space to add commentary if 
it makes sense to do so. 
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Exercise 2.6: let's visualize the weather 

I'm a big fan of bar charts. They are easy to read—our eyes and brains are great 
at comparing lengths when aligned to a common baseline, which is what the bar 
chart does for us. We take in the information in bar charts by comparing the rel¬ 
ative heights of the bars to each other and the baseline, so it's easy to see which 
is biggest and by how much. Also the familiarity of bar charts can be useful when 
communicating: since most people already know how to read them, they can fo¬ 
cus their brainpower on what to do with the data rather than try to figure out how 
to read the graph. 

Let's take a look at an example bar chart. See Figure 2.6a, which shows the weath¬ 
er forecast for the next six days measured by the expected daily high in degrees 
Fahrenheit. 



FIGURE 2.6a The weather forecast 

QUESTION 1: Imagine you are preparing for a Sunday afternoon at the park. 
What temperature would you estimate for the high on Sunday? 

QUESTION 2: You are planning your children's clothes for the coming week and 
trying to decide what type of jacket or coat they'll need midweek. What tempera¬ 
ture might you estimate for the high on Wednesday? 

QUESTION 3: What other observations can you make from this data? 
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Solution 2.6: visualize the weather 

While the temperature may appear to be somewhere in the 90s on Sunday and in 
the 40s on Wednesday, that's not actually the case. Let's take a closer look. 

It turns out that Sunday is 74 degrees while Wednesday is 58 degrees. How is 
that possible? The initial graph in Figure 2.6a does not have a y-axis that starts at 
zero. Rather, it begins at 50. This distorts the data, making it so we can't accurately 
compare the temperature day to day. See Figure 2.6b, which adds both the y-axis 
and data labels to the original graph. 
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FIGURE 2.6b Bar charts must have a zero baseline! 

Let's redesign the graph to start the y-axis at zero. Figure 2.6c shows the side-by- 
side. Notice the difference this makes in interpreting the data. 



FIGURE 2.6c Let's compare the two graphs 
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What looked like a large deviation from the average on the left of Figure 2.6c 
looks comparatively a lot smaller on the right. With this view, you'd likely make 
a different decision when it comes to how thick the kids' coats should be on 
Wednesday! 

There aren't a lot of hard-and-fast rules when it comes to visualizing data. But 
there are a few, and we've just witnessed one of them broken: bar charts must 
have a zero baseline. Because of the way our eyes compare the endpoints of the 
bars to each other and the baseline, we need the context of the full bar there in 
order to make that an accurate visual comparison. 

There are no exceptions. 

That said, this is not a rule that applies to all graphs. With bars, you can't chop or 
zoom because of the way we compare the ends of the bars relative to each other 
and the axis. But with points (scatterplots or dot plots) or lines (line graphs, slope- 
graphs), we focus primarily on the relative positions of the points in space, and 
in the case of line graphs, the relative slopes of the lines that connect the points. 
Mathematically, as we zoom, the relative positions and slopes remain constant. 
You still want to take context into account and avoid overzooming and making mi¬ 
nor changes or differences look like a big deal. Though sometimes minor changes 
or differences are a big deal, so if you find yourself needing to change the axis to 
highlight this, reach for points or lines, not bars. 

On a related note, I've heard the idea raised that a zero baseline for weather 
doesn't make sense, since temperatures can be negative, and zero (particularly on 
a Fahrenheit scale) isn't meaningful. In the case of a short-term weather forecast, 
like we looked at here, the bars are fine so long as we do have a zero baseline al¬ 
lowing us to compare the day by day expectations accurately. On the other hand, 
if we take the case of climate change, for example, a couple degrees change 
in global temperatures—which is nearly impossible to see in bars with a zero 
baseline—is meaningful. This isn't a good argument for changing to a non-zero 
baseline bar chart, but rather for not using bars to illustrate this data. We could 
shift to a line graph or graph the change in temperature instead of absolutes to 
bring focus to the small but meaningful differences. As always, we should step 
back and think critically about what we want to show, then choose an appropriate 
visual to facilitate this. 
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Exercise 2.7: critique! 

Speaking of points (mentioned in the solution to the previous exercise), let's take 
a look at some next, in the context of critiquing a less than ideal graph. 

See Figure 2.7a, which is a dot plot showing the bank index over time for a num¬ 
ber of national banks. Assume you work at Financial Savings. 
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FIGURE 2.7a Bank index 


QUESTION 1: What questions do you have about this data? 

QUESTION 2: If you were designing the graph, what changes would you make? 
How would you visualize this data? 
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Solution 2.7: critique! 

QUESTION 1: This graph seems to incite many more questions than it answers! 
My first question is: what exactly is the metric being plotted? I might assume 
''Bank Index" is some type of customer satisfaction score, and that the higher the 
number, the better. But what if this is really something like bank teller errors? I'd 
interpret that data very differently. 

My next question is, do we need all of the data? We can see at the top that the 
red and yellow data points represent our company (Financial Savings) and the 
Industry Average, respectively (these also seem like odd color choices, though I 
guess they are bright in an attempt to stand out against all of the other color in 
this graph). I assume all of these dots roll up into the average (which is another 
question I have: is that the case?). This begs yet another question: do we need all 
of those individual data points or would showing only Financial Savings and the 
Industry Average work? When you consider getting rid of data, you always want 
to think through what context you lose when doing so. Here, by summarizing with 
the average, we'd lose line of sight to the spread across competitors. Depending 
on our goals, this may or may not be important. 

In terms of other questions, I'm also curious what the red circle in 2019 is meant 
to highlight. I appreciate the thought process behind it: someone looked at this 
data and thought "I'd like you to look here" and drew a red circle. This presents a 
couple of challenges, however. First, there's so much competing for our attention 
in the graph with all the various colored dots that we might not even notice the 
red circle. Second, when we do notice it, it isn't immediately clear what it is trying 
to point out to us. 

My final questions are: So what? What does this data show us? What's the story? 

QUESTION 2: Let's shift from asking questions to redesigning how we show 
this data. It turns out the metric being graphed is branch satisfaction, where the 
higher the number, the better. I'll assume that we care most about how Financial 
Savings compares to the Industry Average. Simply making that decision means I 
can declutter this graph a ton and focus on the data points for Financial Savings 
and the Industry Average. 

Speaking of data points, this data is over time. We can plot it as points, but I'd 
be apt to connect the points and display this data in a line graph. Lines will help 
us more easily see the change over time and can also help highlight interesting 
things when it comes to how these lines interact with each other: if one is always 
above the other, the lines will help us see the gap. If that's not the case, lines will 
help us see when one series crosses the other, which will be interesting as we try 
to answer the question, "So what?" 
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Figure 2.7b shows my makeover of this visual. 
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FIGURE 2.7b Revamped graph 


Decluttering and changing the graph to lines helps us focus on the data. I've titled 
and labeled everything directly, so there's no need to make assumptions or hunt 
around for how to interpret the data. I used the title space to answer the question, 
"So what?" 


With an additional understanding of what's driving the ups and downs in satis¬ 
faction for our business and the industry, I could take this even further. In a live 
meeting or presentation, I could build it line by line or time point by time point, 
which would allow me to focus my audience's attention as I talk through relevant 
context. If it needs to stand on its own, I could put text directly on the graph to 
annotate what's causing the changes we see. We'll look at a number of examples 
that employ these strategies as we get further along. We'll revisit this example 
and look at a scenario where we keep all of the original data in Chapter 4. 


Next, let's redesign another graph. 
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Exercise 2.8: what's wrong with this graph? 

Sometimes, we design a graph with the best of intentions, but inadvertently make 
things difficult for our audience. Let's take a look at an example where this is the 
case and discuss how we can improve it. 

Continuing in the banking industry, next let's imagine you work as an analyst in 
consumer credit risk management. For those who may not be familiar—when 
people take out loans, some portion of those people don't repay them. These 
loans move through various levels of delinquency: 30-days past due, 60-days past 
due, and so on. Once they become 180 days past due, they are categorized as 
"Non-Performing Loans." After reaching this stage of delinquency, despite col¬ 
lection activities, many still don't repay and this results in a loss. Banks have to 
reserve money for these potential losses. 

Now that we're past credit risk 101, let's talk about the data. You've been asked to 
create a graph showing how Non-Performing Loan (NPL) volume compares to the 
Loan Loss Reserves over time. Look at Figure 2.8a. Take note of how your eyes 
move as you process this information. What is confusing about this graph? How 
would you improve it? 


NPLs and Loan Loss Reserves 
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FIGURE 2.8a What is confusing in this graph? 
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Solution 2.8: what's wrong with this graph? 

In considering how I process this data, I start out by doing a lot of back and forth 
between the bars, the lines, and the legend at the bottom of the graph to try to 
interpret the data. I scan the y-axes to see what those are. After some reading 
and thinking about it, I figured out that the lines—Loan Loss Reserves % and NPL 
Rate—are meant to be read against the secondary y-axis on the right-hand side of 
the graph. This means the bars—Loan Loss Reserves and NPLs—should be read 
against the primary y-axis on the left. This seems more difficult than it needs to be. 

Going back to the Loan Loss Reserves % and NPL Rate: I'm not sure what the 
denominator is. I might assume that it's the total loan portfolio—but I'd rather 
this be made clear so I don't have to assume! I'm also unsure these lines add any 
value—they aren't adding any new information. There may be additional context 
that would cause me to make a different decision, but in the absence of that 
I'm going to focus on the volume—the dollars—and not confuse things by also 
showing the rate. As a bonus, this decision will eliminate the secondary y-axis (I 
recommend against the use of dual y-axes in general; for alternate approaches to 
the secondary y-axis, refer to Chapter 2 in SWD). 

It's only after all this that I start paying attention to the x-axis and notice the big¬ 
gest issue: we have inconsistent time intervals. Upon first glance (and several 
thereafter—perhaps you didn't even catch this problem), I started reading the 
x-axis to see it is in units of years and then assumed that continues to be the case 
as we move from left to right. When we read each label, however, we find that 
after 2018, the time interval changes to quarters, and after Q4 it seems we've 
broken out December on its own. This is not good! 

I can appreciate the thought process that presumably led to this. December is 
probably the most recent month. Showing years for historical context is helpful, 
but then it's nice to also show greater granularity (e.g. quarterly, monthly) for the 
more recent time periods. 

Sometimes inconsistent time intervals are a reality—we might be missing data or 
simply have something that occurs inconsistently over time. In that situation, we 
need to denote that visually and make it clear to our audience. The same bar or 
line shouldn't be used to represent a year and a quarter, as this can too easily lead 
to incorrect interpretation and false observations. 

We have a couple of options for overcoming this challenge. If we have all of the 
quarterly data, I'd be apt to just plot that. Bars will get messy simply because 
there would be so many of them, but given that we're getting rid of the two data 
series that were originally depicted by lines, we could swap the original bars and 
plot the volume with lines. Or if for some reason we don't want to or can't show 
the quarterly data, and leave it all in one graph, one option would be to space the 
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x-axis such that each year takes up the same width that the four quarters together 
take up. If I did this, I would exclude year 2019 so there isn't redundancy between 
that and the four quarters that are broken out separately. 

As another alternative, we could split this data into two graphs: one to show the 
annual data for 2014 through 2019, and then a second to break out the quarter¬ 
ly data just for 2019. This would allow me to title each explicitly and make the 
difference in time components clear. I would also compress the quarterly data 
more than the annual data to help visually reinforce the shorter time periods. The 
makeover incorporating these changes is shown in Figure 2.8b. 
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FIGURE 2.8b An alternative view 


In this case, I chose to label the data directly so we can compare the volume of 
NPLs to that of Reserves without the work of having to estimate it from an axis. 
I maintained two decimal places of significance so there wouldn't be instances 
where two points of different heights shared the same value (for example, the 
third and fourth points on the Reserves line would both round to $1.6M, which 
could cause confusion since the heights are visibly different) and so we can easily 
interpret the smaller but meaningful differences in the recent quarterly numbers. 
I used shading to tie the final data point in the first graph (2019) to the quarterly 
breakout for 2019 in the right graph. It is very important in showing the data 
like this to ensure that the y-axis minimum and maximum are set to be the same 
amount across the two graphs, so that the audience can compare the quarterly 
data points to the annual ones according to their relative height. 

With this visual, I've taken away a lot of things that were making us do work in 
the original. Instead of trying to understand the graph, we can focus on the data. 
I can see that the gap between NPLs and our Loan Loss Reserves has narrowed 
markedly overtime. Both were increasing through 2018, but decreased in 2019. 
2019 marks the first time that NPL volume exceeds the Loan Loss Reserves. On a 
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quarterly basis, this happened in Q3 and Q4. This seems pretty important, so we 
should probably take some action! 

Now that you've practiced with me, it's time to tackle some additional examples 
on your own. 
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Exercise 2.9: let's draw 


As illustrated in Exercise 2.3, some of our best tools for figuring out how to show 
our data are a blank piece of paper and pen or pencil. Let's practice using these 
important instruments! 

The following data shows the average time to close a deal (measured in days) for 
direct and indirect sales teams across four products for a given company. Spend 
a moment to familiarize yourself with this data. 

Get a blank piece of paper and set a timer for 10 minutes. How many different 
ways can you come up with to potentially visualize this data? Draw them! (Don't 
worry about plotting every specific data point exactly—quick and dirty to get an 
overall sense of what each visual could look like will suffice.) When the timer goes 
off, look over your sketches. Which do you like best and why? 


Average time to close deal (days) 


Product 

Direct Sales 

Indirect Sales 

Total Sales 

A 

83 

145 

128 

B 

54 

131 

127 

C 

89 

122 

107 

D 

90 

129 

118 


FIGURE 2.9a Average time to close deal 

As part of this, what assumptions are you making about this data? What additional 
context do you wish you had? 
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Exercise 2.10: practice in your tool 

STEP 1: Refer back to the sketches you created as part of Exercise 2.9. Pick one (or 
more for extra credit!), download the data, and create in the tool of your choice. 

STEP 2: After creating your graph(s), pause and reflect upon the following. 

QUESTION 1: What was helpful about sketching? 

QUESTION 2: Did you find anything about the drawing process annoying or 
frustrating? 

QUESTION 3: Was creating a graph in your tool different after first sketching it? 

QUESTION 4: Can you envision using this approach (draw options first, then 
create in your tool) in the future? In what situations? 

Write a few sentences summarizing your thoughts. 


Exercise 2.11 : improve this visual 

Imagine you work for a regional health care center and want to assess the relative 
success of a recent flu vaccination education and administration program across 
your medical centers. 

You have a dashboard where related metrics are reported and your colleague 
pulled the following visual from it. Take a moment to study Figure 2.11 and an¬ 
swer the following questions. 
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Successful Opportunities by Center (FLU) 
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FIGURE 2.11 Original visual from dashboard 


QUESTION 1: How is the data sorted? How else could we sort it? In what circum¬ 
stances would you make a different decision about how to order the data? 


QUESTION 2: There is currently a horizontal line to show the average. How do 
you feel about this? How else could you show the average? 


QUESTION 3: What if there were a target—how might you incorporate it? As¬ 
sume the target is 10%. How would you show this? Now assume the target is 25%. 
Does this change what you would show or how you would show it? 


QUESTION 4: The graph contains a data table. Do you find this effective? What 
are the pros and cons of embedding a data table within a graph? Would you keep 
it or eliminate it in this case? 


QUESTION 5: The graph currently shows the proportion who received the vacci¬ 
nation. What if you wanted to focus on the opportunity—the proportion who did 
not receive the vaccination—how could you visualize this? 

QUESTION 6: How would you graph this data? Download it and create your 
ideal view in the tool of your choice. 
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Exercise 2.12: which graph would you choose? 

Any set of data can be graphed many ways and varying views allow us to see 
different things. Let's look at a specific instance of numerous graphs plotting the 
same data. 

You are visualizing data from your employee survey and want to show how em¬ 
ployees responded this year compared to last year to the retention item "I plan 
to be working here in one year." Figures 2.12a through 2.12d depict four different 
views of the exact same data. Spend some time examining each, then answer the 
following questions. 


OPTION A: pies 

"I plan to be working here in one year" 


strongly LAST YEAR 



FIGURE 2.12a Pies 


OPTION B: bars 
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FIGURE 2.12b Bars 
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OPTION C: divergent stacked bars 

"I plan to be working here in one year" 

STRONGLY STRONGLY 



OPTION D: slopegraph 
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FIGURE 2.12d Slopegraph 

QUESTION 1: What do you like about each graph? What can you easily see or 
compare? 

QUESTION 2: What is difficult about the given view? Are there limitations or 
other considerations of which to be aware? 

QUESTION 3: If you were tasked with communicating this data, which option 
would you choose and why? 

QUESTION 4: Grab a friend or colleague and talk through the various options 
together. Do they agree with your preferred view, or is there a preference for 
another? Did your discussion highlight anything interesting that you hadn't previ¬ 
ously considered? 
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Exercise 2.13: what's wrong with this graph? 

Consider Figure 2.13, which shows response and completion rates for an email 
marketing campaign where email recipients were asked to complete a survey. 

STEP 1 : List three things that are not ideal about this graph. What makes it chal¬ 
lenging? 

STEP 2: For each of the three things you've listed, describe how you would over¬ 
come the given challenge. 

STEP 3: Download the data. Create your visual that puts into practice the strat¬ 
egies you've outlined. 
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FIGURE 2.13 What's wrong with this graph? 
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Exercise 2.14: visualize & iterate 

As we've seen through a number of examples so far, visualizing data—when done 
well-—can help us spark a magical "ah ha" moment of understanding in our audience. 
But often it takes iterating—looking at the data numerous ways—both to better un¬ 
derstand the nuances of the data and what we want to highlight, as well as to figure 
out a way that will work for our audience. Let's practice visualizing and iterating. 

Imagine you work at a medical device company and are looking at data that shows 
patient-reported pain levels when a component of a certain device is turned on 
and when it is turned off. Figure 2.14 shows the data. 


Patient-reported pain 



DEVICE SETTING 

PAIN LEVEL 

ON 

OFF 

IMPROVED 

58% 

36% 

UNCHANGED 

32% 

45% 

WORSENED 

10% 

19% 

TOTAL 

100% 

100% 


FIGURE 2.14 Let's visualize & iterate 


STEP 1: Make a list: how many potential methods can you come up with to visu¬ 
alize this data? What different graphs might work? List as many as you can. 

STEP 2: From the list you've made, create at least four different views of this data 
(draw or realize in the tool of your choice). 

STEP 3: Answer the following related questions: 


QUESTION 1: What do you like about each visual? What is easy to compare? 


QUESTION 2: What considerations or limitations should you take into ac¬ 
count with each? 


QUESTION 3: Which view would you use if you were communicating this data? 
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Exercise 2.15: learn from examples 

A great deal can be learned from the data visualizations that others create-—both 
the good and the not so good. When you see a nice graph, pause and reflect: 
what makes it effective? What can you learn from it that you can apply in your 
own work? When you see a not-so-good example, do the same thing: stop and 
examine what was not done well and how you can avoid similar issues in your own 
work. Let's practice learning from examples. 

Find a graph from the media that's done well and another that is less than ideal. 
Answer the following questions for each of these examples. 

QUESTION 1: What do you like about it? What makes it effective? Make a list! 

QUESTION 2: What do you not like about the example? What limits its effective¬ 
ness? How would you approach it differently? 

QUESTION 3: What learnings from this process can you generalize to guide your 
future work? 


Exercise 2.16: participate in #SWDchallenge 

One of the best ways to learn is to do. The #SWDchallenge is a monthly challenge 
where readers of our blog practice and apply data visualization and storytelling 
skills. You can take part, too! Think of it as a safe space to try something new: test 
out a new tool, technique, or approach. Everyone is encouraged to participate 
and all backgrounds, experience levels, and tools are welcome. 

We announce a new topic at storytellingwithdata.com at the beginning of each 
month. Participants have a set amount of time to find data and create and share 
their visual and related commentary. Historically, the focus has been on different 
graph types, but we sometimes change it up with a tip to try or a specific topic. This 
is meant to be a fun reason to flex your skills and share your work with others. 

All submissions received by the deadline are shared in a recap post later in 
the given month. The monthly challenges and recap posts are archived at 
storytellingwithdata.com/SWDchallenge. 

There are a number of exercises you can undertake related to this challenge. Visit 
storytellingwithdata.com/SWDchallenge and tackle one (or more!) of the following: 

• Participate! Take part in the live challenge by creating and sharing your 
work. Or choose a past one as inspiration to flex your data visualization 
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skills. This can be done on your own, with a partner, or as a small team. 
Share your creations on social media, tagging #SWDchallenge. 

• Emulate! Pick a recap post from the archives and review submissions. Se¬ 
lect a visual you like and work to recreate it in the tool of your choice. Are 
there any aspects you would tackle differently than the original author? 

• Critique! Select a recap post from the archives and examine submissions. 
Pick three you believe are effective and describe what is done well. Con¬ 
sider how you could generalize the learnings to apply to your own work. 
Pick three designs you believe are not ideal and reflect on the issues you 
see and how you could overcome them. What common challenges can 
you generalize from this that you can overcome in your own work? 

• Run your own challenge! Get a group of colleagues or friends together, 
pick a past challenge (or create your own), and run your own version: 
everyone has a set amount of time to find data and create their visual. 
Share them with each other. Get together to discuss, giving everyone 
the opportunity to share their creation and receive feedback from others. 
Examine what you learn from this process that you can apply to future 
work. See Exercise 9.4 for more on how this fun process can help culti¬ 
vate a feedback culture. 
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Exercise 2.17: draw it! 

Consider a current project where you need to visualize data. Grab a blank piece of 
paper and a pen or pencil. Set a timer for 10 minutes and see how many ideas you 
can sketch out for how to potentially show your data. 

When the timer goes off, step back and take inventory of what you've created. Which 
view(s) do you like best? Why is that? 

Show your sketches to someone else. Explain to them what you want to communi¬ 
cate. Which visuals(s) do they like best? Why is that? 

If you ever feel stuck or are looking for an innovative approach and having trouble 
coming up with one on your own, grab a conference room with a whiteboard and a 
creative colleague or two. Talk them through what you want to show. Start drawing— 
and redrawing. Debate as you mock up the different views: what works well? What is 
lacking? Which visuals(s) are worthy of creating in your tools? Can you do it on your 
own, or if not, what or who can help you realize your ideas? 


Exercise 2.18: iterate in your tool 

Allowing yourself time and flexibility to iterate through different views of your data 
allows you both to better understand the nuances and determine which way of show¬ 
ing the data might help you achieve that magical "ah ha" moment of understanding 
that you seek in your audience. 

Take some data you'd like to visualize. Open your favorite graphing tool and start 
creating different visuals. How many ways can you come up with to look at the data? 
Set a timer for 30 minutes and iterate to create different views of the data in your 
graphing application. 

When the timer goes off, assess for each: what are the pros and cons? What do you 
want to enable your audience to see? Which iteration(s) will facilitate this? If unsure, 
jump to Exercise 2.21, which provides some tips for soliciting feedback from others. 
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Exercise 2.19: consider these questions 

When you create a graph, it's not surprising that it makes sense to you. You are famil¬ 
iar with the data, and you know what to look at and what is important. Don't assume 
this is necessarily true for your audience. After you've made your graph, ask the fol¬ 
lowing questions to help determine whether further iteration is necessary. 

• What are you trying to show? What do you want to enable your audience 
to do with your data? Does the visual you've created facilitate this? What 
takeaways are easiest to see? What comparisons are easiest to make? 

What things are harder to do given the way you are showing the data? 

• How important is it? Is this a critical issue or merely something that peo¬ 
ple might find interesting? What are the stakes? Is it a scenario where 
quick and dirty is okay? What level of perfection is warranted? What level 
of accuracy is required? 

• Who is your audience? Is your audience familiar with the data you are 
presenting or is it new? Does it fit in with their preconceived notions, or 
may it challenge a held belief? Does your audience expect the data to 
be presented in a certain way? What are the pros and cons of following 
the norm in this situation compared to doing something new or unex¬ 
pected? What questions will your audience have and how can you antic¬ 
ipate and be prepared to seamlessly address them? 

• Is your audience familiar with the type of graph? Anytime we use some¬ 
thing less familiar to our audience, we are introducing a hurdle: we either 
have to get them to listen to us long enough to tell them how to read 
the graph, or get them to spend enough time with it to figure it out on 
their own. If you're using something less familiar, have a good reason 
for it. Does this view let your audience easily see something that would 
otherwise be difficult, or create a new insight that isn't possible with 
more familiar ways of graphing the data? Consider also: how much time 
do you want to spend talking about the graph—how much of your audi¬ 
ence's brainpower do you want them to spend trying to understand the 
graph versus what the data in the graph shows? 

• How are you presenting the information? Will you be there live in person to 
talk through the data, set context, and answer questions, or are you sending 
something around that has to be processed on its own? Especially in the 
case where you aren't there, you need to take intentional steps to make it 
clear to your audience what the graph represents, how to read it, and how 
you want them to process your data. 
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Exercise 2.20: say it out loud 

After you've created your graph or your slide, practice talking through it out loud. 
If you'll be presenting the data in a live setting (a meeting or presentation), put it 
on the big screen and practice discussing it as you would in a meeting. Even in the 
instance when you will send it off for your audience to process the data on their 
own, there can be important benefits to talking through your graphs. 

First, set up how to read the graph, what it shows, and what each axis represents. 
Then talk through the data and what important observations can be made. What 
you say may reveal pointers on how to iterate. If you find yourself saying things 
like, "This isn't important" or "Ignore that," these are cues for elements you can 
push to the background (or in some instances, eliminate entirely). Similarly, when 
you hear how you direct attention when talking through the data, consider how 
you can achieve this visually through the way you design the graph. 

In the case where you will be presenting the data live, practicing out loud will also 
help make the ultimate delivery smoother. First, do this on your own. Once you 
feel good about that, practice talking through it with someone else and get their 
feedback. The next exercise (Exercise 2.21) provides more pointers for getting 
good graph feedback. 

Want to learn more benefits to saying it out loud? Listen to Episode 6 of the sto¬ 
rytelling with data podcast (storytellingwithdata.com/podcast), which focuses on 
this topic. 


Exercise 2.21: solicit feedback 

You've created a graph and you think it's pretty awesome. The challenge is that 
you know your work better than probably anybody else and since you're the one 
who created the graph, of course it's going to make sense to you. But will it work 
for your audience? 

Or what about the scenario where you've iterated in your tool—you've created 
several different views of the data but aren't entirely sure which one will work best? 

In each of these cases, I recommend soliciting feedback from others. 

Create your visual or set of graphs and find a helpful friend or colleague. It can be 
someone without any context. Have them talk you through their thought process 
for taking in the information, including: 
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• What do they pay attention to? 

• What questions do they have? 

• What observations do they make? 

This conversation can help you understand whether the visual you've created is 
serving its intended purpose, or if it isn't, give you pointers on where to concen¬ 
trate your iterations. Ask questions. Discuss your design choices and talk about 
what is working effectively and what might not be as obvious to someone who is 
less close to the data. Seeking feedback from various sources can also be ben¬ 
eficial: think about when it would be helpful to get feedback from someone in a 
totally different role than your own. 

Also, watch initial facial responses: there is a microsecond that passes before peo¬ 
ple censor their physical reactions. If you see any furrowing of brows or pursing of 
lips—any general face-scrunching—these are micro-cues that something may not 
be working quite right. Pay attention to these cues and work to refine your visuals. 
If people are having a hard time with your graph, don't assume it's them. Consider 
what you can do to make the information easier to take in: perhaps you can more 
clearly title or label, use sparing color to focus attention, or choose a different 
graph type to get your point across more easily. 

You'll find additional guidance for giving and receiving effective feedback in Ex¬ 
ercise 9.3. 


Exercise 2.22: build a data viz library 

Collect and build a library of the effective data visualization examples created and 
used at work. You can do this on your own, or this can be an excellent undertaking 
for a team or organization. Be thoughtful how you organize the content for easy 
searchability (for example, by graph type, topic, or tool). Make files available to 
download so others can see the specifics of how they were made and modify for 
use in their own work. You can also add effective examples that you encounter 
externally from the media, blogs, or #SWDchallenge. 

Make effective data visualization a team goal. To ensure continued focus, host a 
regular friendly competition, where individuals can nominate their own or their col¬ 
leagues' examples of effective data visualization. Each month or quarter, choose 
winners and archive their work in the shared library. This can be a great ongoing 
source of inspiration: if someone is feeling stuck, they have something to turn to 
and flip through for possible ideas. It is also an excellent resource for new hires, 
so they have examples of effective data visualization in your work environment, 
helping set the right expectations for their own work. 
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Exercise 2.23: explore additional resources 

There are many additional resources out there when it comes to choosing an 
effective graph or getting inspiration from other people's creations. Practicing, 
getting feedback, and iterating are keys to success. That said, here are a few chart 
choosers I'm aware of that you may find helpful when it comes to figuring out 
what graphs might work for your specific needs: 

• Chart Chooser (Juice Analytics, labs.juiceanalytics.com/chartchooser). Use 
their filters to find the right chart type for your needs, download as Excel 
or PowerPoint templates and insert your own data. 

• The Chartmaker Directory (Visualizing Data, chartmaker.visualisingdata. 
com). Explore the matrix of chart type by tool and click the circles to see 
solutions and examples. 

• Graphic Continuum (PolicyViz, policyviz.com/?s=graphic+continuum). The 
poster includes more than 90 graphic types grouped into six categories. 

Also check out the related Match It Game and Cards. 

• Interactive Chart Chooser (Depict Data Studio, depictdatastudio.com/ 
charts). Explore the interactive chart chooser using filters. 

Check out the following collections to browse other people's work for inspiration. 
For each graph you encounter, pause to reflect on what works well (or not so well) 
and consider how you can use (or avoid!) similar aspects in your own work: 

• Information Is Beautiful Awards (informationisbeautifulawards.com). These 
annual awards celebrate excellence and beauty in data visualizations, in¬ 
fographics, interactives, and informative art. The archives contain hun¬ 
dreds of data visualizations. 

• Reddit: Data Is Beautiful (reddit.com/r/dataisbeautiful). A place for visu¬ 
al representations of data: graphs, charts, and maps. 

• Tableau Public Gallery (public.tableau.com/s/gallery). Stunning data visu¬ 
alization examples from across the web created with Tableau Public. In 
particular, check out the Greatest Hits Gallery using the drop-down menu. 

• The R Graph Gallery (r-graph-gallery.com). Looking for inspiration or 
help? Here you will find hundreds of distinctive graphics made with the 
R programming language, including code. 

• Xenographies (xeno.graphics). Xeno.graphics is a repository of novel, 
innovative, and experimental visualizations to help inspire, fight xeno- 
graphphobia and popularize new chart types. 
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Exercise 2.24: let's discuss 

Consider the following questions related to Chapter 2 lessons and exercises. Dis¬ 
cuss with a partner or group. 

1. How is the way that we process tables different from how we process graphs? 
What are the pros and cons of presenting data in tabular form? In what circum¬ 
stances does it make sense to use a table? In what scenarios should you avoid 
a table? 

2. One common decision when graphing data is whether to have a y-axis that is 
titled and labeled or omit the axis and label the data directly. What consider¬ 
ations should you make when determining which is better for a given situation? 

3. When is it okay to have a non-zero baseline when graphing data? 

4. Why is paper a good tool for graphing data? Were the exercises in this chapter 
that asked you to draw helpful? Will you use this low-tech method in your work 
going forward? Why or why not? 

5. What is the purpose of graphing a given set of data multiple ways? Why is it im¬ 
portant to iterate and look at different views of your data? When will you take the 
time to do this going forward? When does it not make sense to spend time on this? 

6. The examples in 5WD and in this book are mostly basic charts: a lot of lines 
and bars. When does it make sense to use a graph that is more novel or less 
familiar? What are the pros and cons of using a graph that your audience may 
not have previously encountered? What steps can you take in this situation to 
help ensure success? 

7. Are there any cases where data has historically been graphed a certain way 
by your team or your organization that you believe should be changed? How 
might you drive this change? What sort of resistance or pushback do you antic¬ 
ipate? How can you address this? 

8. What is one specific goal you will set for yourself or your team related to the 
strategies outlined in this chapter? How can you hold yourself (or your team) 
accountable to this? Who will you turn to for feedback? 
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Every element we put in our graphs or on the pages and slides that contain them 
adds cognitive burden—each one consumes brainpower to process. We should 
take a discerning look at the elements we allow into our visual communications 
and strip away those things that aren't adding enough informative value to make 
up for their presence. 

This lesson is simple but the impact is huge: get rid of the stuff that doesn't need 
to be there. We'll illustrate and experience the power of doing so through a hand¬ 
ful of targeted exercises in this chapter. 

Let's practice identifying and eliminating clutter! 

First, we'll review the main lessons from SWD Chapter 3. 


107 


108 identify & eliminate clutter 


^il.lp FIRST, LFT'S RFC/IP 

CLUTTER if your ENEMY 

cumin 3 ' 


J 6 (IUTTER?'° 


VISUAL ELEMENTS 4^ MkT uP WACE 
omA DON'T AID UNDERSTANDINEr 

o 

D 


Cognitive 

LOAD 


Ihe. MEWL EFFORT fM« RFOWRFD 
b LE/IRN WfW INFORMATION 



Cvery element no put- 

on a. paqe or fereen y 0 yye should fake, 
putt cognitive burden cate not to include 
on out audiohee -... things that aren't 
adding information 


lacy. of 

VISUAL 

ORDER 


(AV\oTlmri\jpe of CLUTTER) 

A .<*» 


, LEVERAGE ItvRlTE SPACE 
; W /ll/CA/ ELEMENTS 


JL_ 

v v *> * 

^ ^ 
<£? ^ 


a 


/l/rv /or clean horizontal 
and vertical elements, 
avoid diagonal 



CAFECORV 1 


CATEGORY * 
CATEGORY 3 
CATEGORY 4 



CATEGORY £ P 


T 










chapter summary 


109 


NoNSTRAUGIC 
use cj contrast 


CLEAR CONTRAST is a SIGNAL. 

indicating where tv Look 

Dorif make Too many llnihjs different 
or key jioihls Will gcf-losl- 


GiSSAU 

PRINCIPKS 


DC SCRIBE Row W- SUBCONSCIOUSLY 
ORDER iriwi SEE imIUo WORLD 

We Can use -this umd&rsfawJirujof- how people 
sec- To help idenhfy sj ohmihak. ClUlTGR 


PROXIMITY SIMILARITY 



ENCLOSURE CLOSURE 



CONTINUITY CONNECTION 













110 identify & eliminate clutter: chapter outline 


PRACTICE COLE 


3.1 


3.2 



how c«n 

which testa/f 


we fic 

principles 


words to 

are in play? 


the graph ? 


5.3 

harness 

alignment ^ 

white space 


3.*/ 

declutferf 


I 


Practice tm y wr <9 w/v 


3S 

which Oesialt 
principles 

/» rA iio hUw ^ 



3-fc 


find an 
effective 
visual 


3.T 

cJeclufcr 
(again (J 


create 
align menf 
and i/se 
White space 


3 -10 

dedulter 
(some more /) 



PRACTICE ok WORK 



start with 
a blank 
piece of 
paper 


3.12 
do yoM 
NEED 
that? 


3.13 

lei's 

discuss 

































which Gestalt principles are in play? 


Ill 



Well start by familiarizing ourselves vvilh 
the Gestalt Principles of Visual Perception), 
then explore how we cam use therm 
to dedutter and mate owe visual 
communications easier for our 
audience to process. 


Exercise 3.1: which Gestalt principles are in play? 

The Gestalt principles describe ways in which we subconsciously bring order to 
the things we see. SWD introduced six of these principles: proximity, similarity, 
enclosure, closure, continuity, and connection. We can use the Gestalt principles 
to make our visual communications easier for our audience to process by helping 
make the connections between the different elements we show more obvious. (If 
you aren't familiar with these principles and don't have SWD handy, we'll review 
them in detail through the solution to this exercise.) 

Consider the following visual, which illustrates actual and forecast market size 
(measured by total sales) overtime for a class of pharmaceutical drugs. Which of 
the Gestalt principles mentioned above can you identify? Where and how are 
each used? 
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month). There was a nearly 20% decrease 
in July, when Product X was recalled and 
pulled from the market. Total sales remained 
at reduced volume for the rest of the year. 
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2019: The year started at less than $1.6B. but 
increased markedly in February, when a new 
study was released. Total sales have increased 
steadily since then and this is projected to continue. 
The latest forecast is for S2.4B in monthly sales by 
the end of the year. 
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FIGURE 3.1 Which Gestalt principles are in play? 
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Solution 3.1: which Gestalt principles are in play? 

I've made use of each of the six Gestalt principles in Figure 3.1. Let's briefly discuss. 

Proximity: Proximity is used in a number of ways. The physical closeness of the 
y-axis title and labels indicates to us that those elements are to be understood 
together. The close proximity of the data labels to the data markers makes it clear 
that those relate to each other. 

Similarity: Similarity of color (orange and blue) is used to visually tie sparing words 
in the text at the top to the data points in the graph that those words describe. 

Enclosure: The light grey shading on the right side of the graph employs the en¬ 
closure principle both to differentiate the forecast from the actual historical data 
and also to link that part of the line to the words at the bottom that lend additional 
detail. The lines between 2018 and 2019 on the x-axis also have an enclosing effect. 

Closure: The overall visual makes use of the closure principle. 1 didn't put a border 
around the graph. I didn't need to-—the closure principle says we perceive a set of 
individual elements as a single, recognizable unit. So the graph appears as part of 
a whole. If we look at this on an element-by-element basis, this is true for each of 
the individual text boxes as well. 

Continuity: The dotted line depicting the forecast data on the right-hand side of 
the graph employs the continuity principle. This allows us to make this part of the 
line visually distinct, but still enables us to "see" it as a line. Because dotted lines 
themselves add clutter (since they are many dashes compared to a single solid 
line), I recommend reserving their use for when there is uncertainty to depict, as 
is the case with the forecast. 

Connection: The connection principle is used in the line graph itself, connecting 
all of the monthly data points and making the overall trend easier to see. Each 
axis employs this principle as well, visually connecting dollars on the y-axis and 
time on the x-axis. 

There may be additional uses of the principles that I've not mentioned directly. 
How many of those that I've outlined above did you identify? How might you 
make use of similar strategies in the future? We'll look at additional applications of 
the Gestalt principles through the remaining exercises in this chapter and beyond. 

Expanding to other lessons covered in Chapter 3 of SWD, reflect on how the stra¬ 
tegic use of contrast, alignment, and white space contributed to the effectiveness 
of the visual in Figure 3.1. Speaking of these design elements, we'll do an exercise 
looking at these more closely soon. But first, let's look at how we can use Gestalt 
principles to tie words to the data we show. 
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Exercise 3.2: how can we tie words to the graph? 

When we communicate with data for explanatory purposes, often the final result 
is a slide deck, where each page contains both words and visuals. I frequently 
encounter client examples that have a graph on one side and words on the other 
or words at the top and a graph or two underneath. Often, both the words and 
visuals are important: the words help lend context or describe something and the 
graph helps us to see it. 

The challenge is that this often creates a lot of work for the audience. When we 
read the text, we are left on our own to search in the data for where we should be 
looking in the graph(s) for evidence of what is being said. We have to figure out 
for ourselves how the words relate to the graph and vice versa. 

Don't make your audience do this kind of work: do it for them! 

To help, we can use the Gestalt principles to visually tie the text to the data. Let's 
practice. Consider the following visual. Which Gestalt principles could we make 
use of to tie the words at the right to the graph at the left? List them and either 
describe or draw how you would make use of each. Which would you employ if 
you were communicating this data? 
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FIGURE 3.2a How can we visually tie the words to the graph? 
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Solution 3.2: how can we tie words to the graph? 

When the audience reads the text at the right, there are no visual cues to help 
them know where to look in the graph for evidence of what is being said. They 
have to read, think about it, and search in the graph. This is a straightforward ex¬ 
ample—if we spend some time, we can figure it out. But I don't want my audience 
to have to "figure it out"; 1 want to identify this work that the visual in Figure 3.2a 
implicitly asks my audience to undertake and instead design my visual in a way 
that minimizes or eliminates the work to make things easy on my audience. Tying 
related things through Gestalt principles allows me to do this. 

I will illustrate ideas for making use of four of the Gestalt principles to tie my data 
to the text: proximity, similarity, enclosure, and connection. Let's discuss each of 
these and take a look at how we can apply them. 

Proximity. I can put the text physically close to the data it describes. This tends to 
be a good approach as long as you can do so without interfering with the ability 
to read the data. See Figure 3.2b. 
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FIGURE 3.2b Proximity 


Because of the close proximity of the text to the data it describes, this takes away 
some of the work. That said, we still have to make some assumptions or read the 
x-axis to orient ourselves to exactly which data points are being described. If we 
wanted to illustrate this more quickly, we might somehow make those individual 
data points distinct. See Figure 3.2c. 
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FIGURE 3.2c Proximity with emphasis 

In Figure 3.2c, both the darker grey for the points of interest and the sparing bold 
in the text allow us to quickly understand as we read the various observations 
described in the text which data points illustrate these takeaways. When we put 
text directly on the graph, however, it can sometimes make it harder to see what's 
going on in the data. At other times, text directly on the graph can feel cluttered 
or you simply may not have the room for it. In those instances, look to one of the 
following solutions. 
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Similarity. We can keep the text at the right, but employ similarity of color to tie 
the words to the graph. See Figure 3.2d. 
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FIGURE 3.2d Similarity 
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When I process the information in Figure 3.2d, my eyes do a lot of bouncing back 
and forth. I start at the top left, then my eyes scan to the right, pausing on the red 
bar and then over to the first block of text at the right and the red "April." Then I 
continue reading downward and encounter the orange "Summer," which prompts 
me to bounce leftward to the orange bars. Then finally, I pause on the blue bars 
and then read the text that describes them. This feels pretty natural to me and I 
employ this strategy frequently. Still, let's look at some other options. 

Enclosure. We can physically enclose the text with the data it describes. See Fig¬ 
ure 3.2e. 
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FIGURE 3.2e Enclosure 


In Figure 3.2e, the light shading is meant to connect the data points to the text. If 
the data were shaped differently, this method may not work so well. For example, 
if the September bar had value 0.8%, this would be confusing because it would 
cross both the first and second grey shaded areas, and could cause us to think we 
should relate it to one of those, when really none of the text is about that partic¬ 
ular data point. 


Though I like how this looks, another drawback compared to the previous use of 
similarity of color is that we don't have any visual cues to help us talk about this 
data. If I will be presenting this graph live, it can be useful to be able to say things 
like "Look at the red bar, which shows..." or "The blue bars indicate where...". I 
could solve this by adding color to my shaded regions. See Figure 3.2f. 
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FIGURE 3.2f Enclosure with color differentiation 


I could take this a step further and use similarity of color for sparing data and 
words in addition to the shaded regions to make it clear which text relates to 
which data points. See Figure 3.2g. 
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FIGURE 3.2g Enclosure plus similarity 
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Connection. Another way we could tie the words to the data is by physically con¬ 
necting them. Figure 3.2h illustrates this. 


2019 monthly voluntary attrition rate 
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Attrition is typically low in 
November & December 
due to the holidays. 
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FIGURE 3.2h Connection 


This works well given the layout of the data and varying heights of bars. This 
method will look the cleanest when you have only a few lines and are able to 
orient them horizontally (diagonal lines look messy and are attention grabbing, 
so if you find yourself needing to use diagonal lines, I would recommend using 
similarity rather than connection). Notice that the lines themselves don't need to 
draw any attention—they can be thin and light, so they are there for reference, but 
aren't distracting from our data. 

There is still some processing I have to do, though, with the view in Figure 3.2h. 
I have to read the middle block of text to know that it applies to not only the Au¬ 
gust bar, but also to the July bar that precedes it. Similar processing is needed for 
December and the final takeaway. I could ease this work by layering on similarity 
of color. See Figure 3.2i. 
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FIGURE 3.2i Connection plus similarity 


Figure 3.2i makes it clear through both connection and similarity which text re¬ 
lates to which data. 


In choosing among these options (and you may very well have come up with ad¬ 
ditional options), I tend to favor the simple similarity of color illustrated in Figure 
3.2d. The preceding view in Figure 3.2i is a close second for me. How were the 
ideas you came up with similar or different from mine? After reading my explana¬ 
tion, does your decision about how you'd communicate this data stay the same? 

As with most of what we've been discussing, there is no single right answer. Dif¬ 
ferent people will make different choices. Of utmost importance is that you make 
it easy for your audience. When you show text and data together, make it clear to 
your audience when they read the text, where they should look in the data for evi¬ 
dence of what's being said, and when they look at the data, where they should look 
in the text for additional detail. The Gestalt principles can help you achieve this. 
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Exercise 3.3: harness alignment & white space 

We've looked at Gestalt principles for organizing what we see. We should also 
eliminate visual clutter. When elements aren't aligned and white space is lacking, 
things feel cluttered. The concept is similar to cleaning up a messy room: put 
everything in its place and magic happens—the same items are present but now 
there is a harmonious sense of order. 

Let's do a quick exercise to illustrate how we can undertake a similar endeavor 
with our graphs. These seemingly minor components of our visual designs can 
have major impacts on the overall look and feel of what we create as well as the 
perceived ease with which our audience can consume it. 

See Figure 3.3a, which is a slide showing data about physicians writing prescrip¬ 
tions (writers) for a pharmaceutical drug (Product X) across three different promo¬ 
tions (A, B, and C). The slopegraph compares the percent of total across the three 
promotion types between repeat writers (those physicians who have prescribed 
Product X before) at the left and new writers (those prescribing it for the first time) 
at the right. The details in this case aren't so important. It's probably also worthy 
of note that we could debate whether this is the best way to show this data, but 
let's not worry about that—rather, let's focus on how we might better arrange the 
current components. 

What changes would you make when it comes to alignment and white space to 
improve this visual? Are there other changes you would suggest? Write them down. 

If you'd like, you can download this visual and implement the changes you've outlined. 


There were 45K new writers in the past year. 

The distribution across promo types looks different than repeat writers. 


Product X writers by promo type 


% OF TOTAL 


Though Promo A makes up 

59% 


the biggest segment 

^ DDnMn A 
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to new writers than to 
repeat writers. 

PROMO B 
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Both Promo B and Promo C 
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—• 
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of new writers compared to 

13 % __- 
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How should we use this 
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New writers 

strategy? 
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FIGURE 3.3a How might we better use alignment & white space? 
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Solution 3.3: harness alignment & white space 

The visual in Figure 3.3a feels sloppy. It appears as if elements were simply thrown 
onto the slide. A couple of minutes and quick changes to alignment and use of white 
space can bring a sense of order and make the information easier to assimilate. 

First, let's discuss alignment. Currently, the text on the slide is all center-aligned. 

I tend to avoid center alignment because it can leave things hanging in space. 
Additionally, when the text flows onto multiple lines, it creates jagged edges that 
look messy. I'm an advocate of left or right-aligning text boxes to create clean 
vertical and horizontal lines across the elements. Doing so allows us to makes use 
of the Gestalt principle of closure—as we create framing, it helps tie the elements 
of the slide together. In this case, I'll left-align the takeaways at the top, the graph 
title, and the left x-axis labels (Repeat writers, 92K). I'll pull the data labels within 
the graph to be labeled at the left for Repeat writers and at the right for New writ¬ 
ers. I'll also pull the Promo A, Promo B, Promo C descriptions out of the middle of 
the graph and orient those on the right, lining them up horizontally with the data 
labels for those points (I could have also put these to the left of the left labels— 
which I choose typically depends where I want my audience to focus). Finally, I 
left-aligned the text at the right. 

I left-justified most of the text (the only exception is the New writers and 45K x-ax- 
is label, which I've right-justified to provide framing on the right side of the graph). 
Which you choose between left or right alignment (or in rare instances, center 
alignment) depends on the layout of the elements on the rest of the page. The 
idea is to create clear vertical and horizontal lines. Sometimes right-justified text 
will also work well, and you'll see this in a number of examples throughout this 
book. I did try right-aligning the text at the right side of the page, but it created 
some jagged trapped white space in the middle of the page that I didn't like, so 
I reverted to left alignment. 

The other change I made was in regards to white space. I pulled the graph title 
up so there would be a little space between it and the graph. I reduced the width 
of the graph both to allow room to label the various data series at the right as 
well as to have some space between that and the text box at the right. Probably 
the biggest (and fastest to implement) change in this area was simply adding line 
breaks to the text on the right, making it easier to scan and a little nicer to view. 


PRACTICE k/4 COLE 


PRACTICE \m'M, COLE 


122 identify & eliminate clutter 


You can see all of these changes implemented in Figure 3.3b. 

There were 45K new writers in the past year. 

The distribution across promo types looks different than repeat writers. 


Product X writers by promo type 

% OF TOTAL 



Though Promo A makes up the 
biggest segment overall, it 
contributes less to new writers 
than to repeat writers. 


28% 



33% PROMO B 


13 % 



22% PROMO C 


Both Promo B and Promo C 

brought in higher proportions of 
new writers compared to repeat 
writers. 


How should we use this 

Repeat writers New writers data for our future promotion 

92K 45K strategy? 


FIGURE 3.3b Better employment of alignment & white space 

Compare Figure 3.3b to Figure 3.3a. How does this ordered graph feel compared to 
the original? I appreciate the sense of structure in Figure 3.3b that was initially lacking. 


You may have made some different decisions in your redesign, and that's okay. The 
main point is to be thoughtful in your use of alignment and white space. These 
small things can have a big impact! 


Well look at more examples illustrating the benefit of paying attention to detail in 
our visual design in Chapter 5. 
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Exercise 3.4: declutter! 

A frequent source of clutter in data visualization comes from unnecessary graph 
elements: borders, gridlines, data markers, and the like. These can make our visuals 
appear overly complicated and increase the work our audience has to undertake in 
order to understand what they are viewing. As we eliminate the things that don't 
need to be there, our data stands out more. Let's take a closer look at the benefit 
that decluttering can have on our data visualizations. 

See Figure 3.4a, which shows time to close deals, measured in days, for direct and 
indirect sales teams over time. 

What visual elements could you eliminate? What other changes would you make to 
what is shown or how it's shown to reduce cognitive burden? Spend a moment consid¬ 
ering this and make some notes. How many changes would you make to this visual? 


Time to Close Deal 

^^^Goal^90da%^^^ 


140.0 

120.0 

100.0 

80.0 

60.0 

40.0 

20.0 

0.0 



N °> & N °> 
'f ^ 





■ Direct Sales ■ Indirect Sales 


FIGURE 3.4a Let's declutter! 
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Solution 3.4: declutter! 

I identified 15changes I'd like to make to the Time to Close Deal graph. If your list 
falls short of this, refer back to Figure 3.4a and take another minute or two to see 
what else you might modify before you read through my ideas. 

Ready? Let me walk you through what I would do—step by step—and my thought 
process behind each choice. 

1. Remove heavy lines. The heavy horizontal lines between title and graph and 
at bottom graph border are unnecessary. The closure principle tells us we already 
see the graph as part of a whole—we don't need these partial enclosures to make 
that clear. Use white space instead to set the title and graph apart from other 
elements as needed. 


Time to Close Deal 

Goal = 90 days 




■ Direct Sales ■ Indirect Sales 
FIGURE 3.4b Remove heavy lines 

2. Remove gridlines. Gridlines are gratuitous! It's amazing to me how much the 
simple steps of removing the chart border and gridlines do to make our data 
stand out more. See Figure 3.4c. 
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■ Direct Sales ■ Indirect Sales 


FIGURE 3.4c Remove gridlines 


3. Drop trailing zeros from y-axis labels. This is one of my pet peeves! The zeros 
after the decimal place carry no information—get rid of them. I'm also going to 
change the frequency of my y-axis labels. While labeling every 20 makes sense 
given the scale of the numbers, we're plotting days, so something like every 30 
would make more sense (approximating a month). I started with that, but the axis 
looked overly sparse, so I opted for every 15 days. Let's also add a y-axis title so 
we know what we're looking at. I advocate titling axes directly so that your audi¬ 
ence isn't left questioning or having to make assumptions to decipher the data. 


Time to Close Deal 

Goal = 90 days 


eo 

o 

—I 

o 

o 

H 

(/) 

<C 

o 
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FIGURE 3.4d Drop trailing zeros from y-axis labels 
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4. Eliminate diagonal text on x-axis. Diagonal text looks messy. Worse than that, 
studies have shown it's slower to read than horizontal text (vertical text, by the 
way, is even slower). If efficiency of information transfer is one of your goals when 
communicating with data-—which I'd argue it should be—aim for horizontal text 
whenever possible. 

I see this issue often with diagonal x-axis labels, where the years are repeated with 
every date—which is both redundant and also space constraints frequently force 
the date to be diagonal. We can avoid this by using month abbreviations for the 
primary x-axis label, and then use year as supercategory x-axis label or title the 
axis with the year. In this case, I've simply titled the x-axis with the year (2019) to 
make the date range clear. 
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■ Direct Sales ■ Indirect Sales 
FIGURE 3.4e Eliminate diagonal text on x-axis 

5. Thicken the bars! Another pet peeve is when the white space between the bars 
is bigger than the bars themselves. Let's thicken those. This also makes use of the 
connection Gestalt principle, where with reduced distance between the bars, my 
eyes start to try to draw lines between the bars (if moving from bars to lines was on 
your list of recommendations, don't worry, it's on mine, too—we'll get there soon!). 
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Time to Close Deal 

Goal = 90 days 



2019 


■ Direct Sales ■ Indirect Sales 

FIGURE 3.4f Thicken the bars! 

6. Pull data labels into ends of bars. Now that we've thickened the bars, we have 
space to pull the data labels into the bars. This is a cognitive trick. Looking back 
to the previous iteration (Figure 3.4f), when we had a data label outside the end 
of each bar, each bar and label acted as two discrete elements. Now that we've 
thickened the bars, we can have room to embed the data labels, taking what were 
previously two distinct elements and turning them into a single element. This re¬ 
duces the perceived cognitive burden, without reducing any of the actual data that 
we are showing. 

Previously, we had a decimal of significance on each data label. This will always 
be context-dependent, but in this case given the scale of the numbers, we don't 
need that specificity (also be cautious of the aforementioned false precision that 
having too many decimal places can convey). Here, we have an added benefit: 
reducing precision allows us to more cleanly fit the labels into the ends of the 
bars. I've made the labels white (versus the previous black), simply because I like 
the contrast you get with white-on-color. See Figure 3.4g. 


PRACTICE k/4 (OLE 




128 identify & eliminate clutter 


LU 



— j 

LU 

135 


c n 
o 


v-» 

_j 

o 

120 


o 
1- 
C 0 
>- 

105 

90 


< 

o 

rs 

LD 


60 

VJ 


45 

V— 


30 

VJ 





15 



0 


Time to Close Deal 

Goal = 90 days 



■ Direct Sales ■ Indirect Sales 
FIGURE 3.4g Pull data labels into ends of bars 

7. Eliminate data labels. In the previous step, I rounded and pulled the data la¬ 
bels into the ends of the bars. Note, though, that we don't need both the y-axis 
and every data point labeled—this is redundant. This is a common decision point 
when it comes to visualizing data: do I preserve the axis, label the data directly, 
or some combination of the two? The main thing you want to consider when 
making this decision is the degree of importance of the specific numeric values. If 
it's critical that your audience know that time to close a deal for Direct Sales went 
from exactly 74 days in November to exactly 46 days in December, then you could 
label the data directly and get rid of the y-axis altogether. If, on the other hand, 
you want your audience to focus on the shape of the data, or general trends or re¬ 
lationships, then I'd recommend preserving the axis and not cluttering the graph 
with the data labels. 

In this case, I'll assume the shape and general trends are more important than the 
precise numeric values. Because of this, I'm going to keep the y-axis and remove 
the data labels from each bar. 
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■ Direct Sales ■ Indirect Sales 

FIGURE 3.4h Eliminate data labels 

8. Make it a line graph. If you've been thinking this entire time: "This is data 
over time, so shouldn't it be a line graph?" I'm with you. Check out the impact of 
moving from bars to lines. We're using less ink, so the overall design feels cleaner. 
This is also a big win from a cognitive burden standpoint: we've taken what was 
previously twenty-four bars and replaced them with just two lines. 


Time to Close Deal 

Goal = 90 days 
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FIGURE 3.4i Make it a line graph 
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9. Label the data directly. Look back to Figure 3.4i and locate the legend. Your 
eyes do a bit of scanning to find it, right? This is work. It's perhaps more notice¬ 
able now that we've taken a number of steps to reduce the other issues that 
were causing cognitive burden. This is the kind of work that we want to identify 
and take upon ourselves—as the designers of the information—so our audience 
doesn't have to exert effort to figure out what they are seeing. 

We can use the Gestalt principle of proximity to put the data labels right next to the 
data they describe. This eliminates any searching to figure out how to read the data. 


Time to Close Deal 

Goal = 90 days 



2019 

FIGURE 3.4j Label the data directly 


10. Make data labels the same color as the data. While we leverage proximity— 
putting the data labels right next to the data they describe—let's also make use of 
similarity, recoloring the data labels to match data they describe. It's another visual 
cue to our audience to indicate that these things are related. 
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Time to Close Deal 

Goal = 90 days 



2019 

FIGURE 3.4k Make data labels the same color as the data 


11. Upper-left-most orient graph title. Without other visual cues, your audience 
will start at the top left of your page, screen, or graph, and do zigzagging "z's" 
to take in the information. Because of this, I advocate upper-left-most justifying 
graph and axis titles and labels. This means your audience sees how to read the 
data before they get to the data itself. As we discussed in Exercise 3.3, I tend 
to avoid center alignment of text because it leaves things hanging out in space 
(revisit Figure 3.4k and look at the graph title positioning). When you have text 
that goes onto multiple lines, this creates jagged edges that look messy. While I 
was changing the positioning of the title, I also eliminated the italics, which were 
unnecessary. 

Time to Close Deal 

Goal = 90 days 
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FIGURE 3.41 Upper-left-most orient graph title 
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12. Remove title color. Up until now, the graph title has been blue. Did you find 
yourself trying to tie that to the Indirect trend in the data? That's the Gestalt prin¬ 
ciple of similarity at play: we naturally try to associate things of similar color. In this 
case, that is a false association. Let's eliminate it by taking color out of the title 
entirely (we'll look at an alternative approach where we do use some color in the 
title to make use of this natural association momentarily). 

Time to close deal 

Goal = 90 days 



FIGURE 3.4m Remove title color 


13. Put the goal in the graph. Originally, the subtitle told us that the goal for time 
to close a deal is 90 days. If we want to be able to relate this to the data (are we 
above goal? below goal?)—which would make sense here—let's put that informa¬ 
tion in the graph directly so we can visually compare the data and don't have to 
really think about it. 
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Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
2019 


FIGURE 3.4n Put the goal in the graph 

14. Iterate to best visualize Goal. The Goal line is quite pronounced in the pre¬ 
ceding graph. Let's take a look at some different views. This is a good example 
of how allowing yourself time to iterate and look at things a number of different 
ways can be useful at many points during the process. I like using dotted lines for 
goals or targets, but when the line is thick, this introduces visual clutter. When I 
make the line thinner, it deemphasizes it—so it's there and easy to see, but isn't 
drawing undue attention. I also like all caps for short phrases, such as GOAL, 
because they are easy to scan and create nice rectangular shapes (versus mixed 
case, which doesn't have clean lines at the top since letters such as I are taller than 
letters such as a). 

Iterating on goal line 


SOLID DASHED THIN & DASHED THIN. DASHED & CAPITALIZED 



FIGURE 3.4o Iterating: different formatting for Goal line 
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Let's look at a bigger version of that final iteration. 


Time to close deal 



FIGURE 3.4p My favorite of the Goal line iterations 


15. Remove color. Given that we have sufficient spatial separation between the 
lines in this graph, we don't need to use color as a categorical differentiator the 
way we have up to this point. I'll make everything grey. Color will come back into 
play when we focus attention—let's do that next. 
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FIGURE 3.4q Remove color 
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Focus attention. We're jumping ahead, but now that we've come this far it only 
feels right that we carry this one through to the end. I'll stop numbering my steps 
at this point, since I'm no longer decluttering. In the preceding view, I pushed 
everything to the background—made it all grey. This forces me to be thoughtful 
about where and how I direct my audience's attention. There are a number of 
things we could point out in this data. Let's assume for a moment that we want to 
draw attention to the Indirect data series in this graph. 

Figure 3.4r illustrates one way I could achieve that. 


Time to close deal: indirect varies over time 



2019 

FIGURE 3.4r Focus attention 


Notice how my words above the graph are tied visually to the Indirect trend within 
the graph through similar use of color. This is akin to how we perhaps tried to tie 
the original blue title in the initial graph to the blue trend, only this time I'm doing 
it on purpose and it makes sense. By reading those words at the top of the graph, 
my audience will already know what to look for in the data before they get there. 
Also, if we think of this from a scannability standpoint, if I only look at this for a 
couple of seconds, the colored words and line draw my attention and I clearly and 
quickly get the takeaway that time to close indirect deals varies. 
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Focus attention elsewhere. I can use this same strategy and some highlighted data 
points to make a different point. See Figure 3.4s. 


Time to close deal: indirect sales missed goal 3 times 
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FIGURE 3.4s Focus attention elsewhere 


Introduce a bit more color to really direct attention. I could take the preceding ex¬ 
ample a step further by introducing another color. I tend to avoid green for "good" 
and red for "bad," because of the inaccessibility for colorblind audience members. 
Bright orange can have a similar negative connotation to red and stands out quite 
nicely given the other colors we have in play (this particular shade of blue was cho¬ 
sen because it matched the client's branding). 


Time to close deal: indirect sales missed goal 3 times 
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FIGURE 3.4t Introduce a bit more color to really direct attention 
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Focus attention on yet another takeaway. If the times where we've missed goal 
aren't the crux of the point we want to make, we could direct attention to a me¬ 
ta-takeaway: we're beating our goal most of the time across both Indirect and 
Direct sales. We can use words and color to make this point clear. In this view, 
with the end markers and data labels, one obvious comparison for our audience 
to make is how Indirect and Direct time to close deals compared to each other 
and the goal in December. 

Time to fill: beating goal the majority of time 
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FIGURE 3.4u Focus attention on yet another takeaway 

We'll look at more strategies for focusing attention in Chapter 4. 

Next, let's shift gears and have you undertake some practice on your own. 
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Exercise 3.5: which Gestalt principles are in play? 

As we've discussed and seen illustrated through a number of examples so far, 
Gestalt principles help us organize what we see, providing cues to what clutter we 
can eliminate and relating elements to each other in various ways. Consider the 
six principles we've covered—proximity, similarity, enclosure, closure, continuity, 
and connection—and review the following visual. 

Which Gestalt principles are being used in Figure 3.5? Where and how? What 
effect does each achieve? 


Wallet share by growth type 
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FIGURE 3.5 Which Gestalt principles are in play? 
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Exercise 3.6: find an effective visual 

Find an example graph that you believe is effective. This can be from your work, 
someone else's work, the media, storytellingwithdata.com, or other sources. Are 
any Gestalt principles being used? I'll bet there are. Which and how? List them! 
What do the Gestalt principles you identify help achieve? What else do you like 
about the graph? What makes it effective? 

Write a paragraph or two outlining your evaluation. See the following for reference. 
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Exercise 3.7: alignment & white space 

Alignment and white space'—these are elements of the visual design, that when 
done well, we don't even notice. However when they aren't done well, we feel it 
in the resulting visual. It may seem disorganized or connote lack of attention to 
detail, distracting from our data and message. 

Consider Figure 3.7, which shows consumer sentiment based on survey results 
about various potential beverage line extensions for a food manufacturing com¬ 
pany. Complete the following steps. 

STEP 1: Reflect on what specific changes you would recommend when it comes 
to the effective use of alignment and white space. List them. 

STEP 2: Think back to the other lessons covered in this chapter (making use of 
Gestalt principles, decluttering, employing contrast). What other steps would you 
take to declutter or otherwise improve this visual? 

STEP 3: Download the data and make the changes you've recommended to the 
existing graph, or import the data into your preferred tool and create a clutter-free 
visual that aligns elements and makes use of white space. 



FIGURE 3.7 How can white space & alignment improve this visual? 
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Exercise 3.8: declutter! 

As we've seen through a number of the examples we've looked at already, there 
can be great value in identifying and removing things in our visuals that don't 
need to be there. Each element we take away helps our data stand out more 
and frees up space so we can add things that matter. Let's continue to practice 
this important skill of identifying clutter and removing it from our visual designs. 
We'll do a few of these so you can have the opportunity to identify a number of 
different types of clutter. 

Consider Figure 3.8, which shows customer satisfaction score over time. What 
unnecessary visual elements could you eliminate? What other changes would 
you make to reduce cognitive burden? Make some notes about the changes you 
would make to declutter this graph. 

To take it a step further, download the data and make the changes you've rec¬ 
ommended to the existing graph, or import the data into your preferred tool and 
create a visual that is clutter-free. 


Customer Satisfaction Score 



May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 


FIGURE 3.8 Let's declutter! 


Practice tm y*wr own 





























Practice tw 'jeurDWN 


142 identify & eliminate clutter 


Exercise 3.9: declutter (again!) 

Clutter takes a lot of different forms. Let's look at another example graph that can 
be improved. 

Check out Figure 3.9, which shows the monthly number of cars sold by a nation¬ 
al chain of dealerships. What unnecessary visual elements could you eliminate? 
What other changes would you make to reduce cognitive burden? Make some 
notes about the changes you would make to declutter this graph. 

To take it a step further, download the data and make the changes you've rec¬ 
ommended to the existing graph, or import the data into your preferred tool and 
create a visual that is clutter-free. 



FIGURE 3.9 Let's declutter! 
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Exercise 3.10: declutter (some more!) 

Here is another opportunity to identify and eliminate clutter. See Figure 3.10, 
which shows the percent of customers of a given bank having automated pay¬ 
ments by different products. What unnecessary visual elements could you elim¬ 
inate? What other changes would you make to reduce cognitive burden? Make 
some notes about the changes you would make to declutter this graph. 

To take it a step further, download the data and make the changes you've rec¬ 
ommended to the existing graph, or import the data into your preferred tool and 
create a visual that is clutter-free. 



FIGURE 3.10 Let's declutter! 
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Exercise 3.11: start with a blank piece of paper 

Often, the culprit behind the clutter in our visuals is our tool. When we draw, each 
stroke of the pen or pencil takes effort. We don't tend to put forth that effort un¬ 
less it's worth it, which means it's harder for non-information-carrying elements to 
work their way into our designs. 

In the same way we used drawing to brainstorm and iterate through different 
views of our data in Chapter 2, there are benefits to drawing from a decluttering 
standpoint, too. 

Consider a project where you need to communicate with data. Spend some time 
getting familiar with the data and what you want to communicate. Get yourself a 
blank piece of paper and sketch your visual. Consider whether you've included 
anything that's not necessary. Once you have it right on paper, determine what 
tools or experts you have at your disposal to make your ideas real. 
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Exercise 3.12: do you need that? 

Once we've taken the time to put something together, it can become difficult to 
look at it with fresh eyes and determine what we should eliminate. After creating 
a visual, pause and ask yourself the following questions. Or take an existing visual 
from a regular report or dashboard and assess how it could be improved through 
decluttering. 


• What visual clutter can you eliminate? Are there any unnecessary elements 
distracting from your data or message? You can usually get rid of borders and 
gridlines. Is anything unnecessarily complicated? How might you simplify? 
What feels like work? How can you remedy? What other changes can you 
make to reduce cognitive burden? 

• Is there redundant information you could streamline? It's important to clearly 
title and label everything, but look for redundancy that can be eliminated. For 
example, decide whether an axis or data labels will best meet your needs—you 
typically won't need both. Units should be clearly displayed, but may not need 
to be attached to every data point. Use effective titling to help streamline. 

• Is all of the data you are showing necessary? Go through each piece of data in 
your graph or presentation and ask yourself whether you need it. If you plan to 
remove any data, consider what context you lose with it. In some cases, this still 
makes sense. As part of this, think about: what is the right time frame to show? 
What are the important comparison points? Are they all equally important? Re¬ 
flect on what aggregation or frequency makes sense—sometimes rolling daily 
data into weekly or monthly data into quarterly (for example) can simplify and 
make it easier to see overarching trends. 

• What could you push to the background? Not all elements in a chart or 
on a page are equally important. Where can you make use of grey to push 
non-message-impacting components to the background and employ strate¬ 
gic contrast to help direct attention? 

• Seek feedback. Recruit a colleague to look at your visual and ask probing 
questions that will force you to talk through what you are showing. If you ever 
find yourself saying things like, "Ignore this" or they ask you questions about 
points that you thought were clear, these are verbal cues that you can use to 
further refine your visual, pushing less important elements to the background 
or getting rid of them entirely. Make changes and then go through the pro¬ 
cess with someone else. Iterate based on feedback to help push your work 
from good to great. 


PRACTICE aA WORK 


PRACTICE aA WORK 
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Exercise 3.13: let's discuss 

Consider the following questions related to Chapter 3 lessons and exercises. Dis¬ 
cuss with a partner or group. 

1. Why is it important to identify and eliminate clutter? What are common types 
of clutter that you will remove from your visual communications going for¬ 
ward? When does it not make sense to spend time decluttering? 

2. Review the Gestalt principles. Which of these would you like to use more 
in your work? How will you do so? Are there any that don't make sense or 
where you are unclear how you could use them in your work? 

3. Is there any common clutter that your graphing application routinely adds 
to your visuals? How can you streamline your process of decluttering to be 
more efficient in your tools? 

4. We've looked at some examples where data over time is plotted as bars. 
What are the benefits from a clutter standpoint of showing this data as a line 
graph? When would it make sense to do this? In what scenarios would you 
keep the bars? 

5. What is one tip you picked up from the lessons in this chapter that you plan 
to employ going forward? Where and how will you make use of it? Can you 
foresee exceptions where you wouldn't put into practice the given strategy? 

6. Lining up elements, preserving white space, and employing strategic con¬ 
trast: are these just about making things pretty, or is there more to it than 
that? Does this sort of attention to detail matter? Why or why not? 

7. Can you imagine any situations where clutter is desirable? When and why? 

8. What is one specific goal you will set for yourself or your team related to the 
strategies outlined in this chapter? How can you hold yourself (or your team) 
accountable to this? Who will you turn to for feedback? 
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Where do you want your audience to look? It's a simple question, yet one we fre¬ 
quently don't give much thought to when we are creating graphs and the pages 
that contain them. We can take intentional steps in our visuals to make it clear to our 
audience where they should pay attention and in what general order. This can be 
achieved by using preattentive attributes—such as color, size, and position—stra¬ 
tegically. Not everyone sees the same thing when they look at data, but by taking 
thoughtful design steps, you can help your audience focus on the right things. 

Let's practice focusing attention! 

First, we'll review the main lessons from SWD Chapter 4. 
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Exercise 4.1: where are your eyes drawn? 

I frequently employ a simple strategy to figure out whether I'm directing my au¬ 
dience's attention effectively—the "Where are your eyes drawn?" test. It's easy 
to do: create your graph or slide, then close your eyes or look away. Look back 
at it, taking note of the point your eyes go first. This is probably the place your 
audience's eyes will land as well. You can use this method to test whether you are 
directing attention in the right way and make adjustments as needed. 

Let's practice this test with a few images and discuss the implications for when we 
are communicating with data. 

For each of the following, close your eyes for a moment, then open them and 
look at the picture, paying attention to where your eyes go first. Why is this, do 
you think? What learnings from this exercise can you generalize and apply to your 
data visualizations? Write a couple of sentences or short paragraph for each. 



FIGURE 4.1 a Where are your eyes drawn? 
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FIGURE 4.1 b Where are your eyes drawn? 



FIGURE 4.1 c Where are your eyes drawn? 
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FIGURE 4.1 d Where are your eyes drawn? 



FIGURE 4.1 e Where are your eyes drawn? 
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Solution 4.1 : where are your eyes drawn? 

I have a lot of fun with this test. It's interesting to see what things in our environ¬ 
ment pull at our attention, and how we can generalize learnings from these obser¬ 
vations. The following recounts where my eyes landed first in the preceding imag¬ 
es and a few ideas for applying similar aspects when we communicate with data. 

FIGURE 4.1 a: My eyes immediately go to the speed limit sign on the right. This 
is for a number of reasons. The face of the sign is large compared to the rest of 
the elements in the picture. The big, bold, black number on white is striking. The 
red on the sign demands attention, both because it's very different from the back¬ 
ground and because we are conditioned over time that red is often an alert to 
which we should pay attention. For someone who is red-green colorblind, howev¬ 
er, red likely won't have the same effect. That's one of the reasons it can be useful 
to have some redundancy of signals to direct attention and make sure everyone in 
your audience can fully see what you are showing. Finally, there's a white outline 
at the edge of the sign that sets it apart from the background. 

Let's consider how we can apply these elements to our data visualizations and the 
pages that contain them. Size, typeface, color, enclosure: these are all elements 
that, when used sparingly, indicate relative importance to our audience, signaling 
them where to look. 

FIGURE 4.1 b: My eyes go to the sun, then to the car, and then back to the sun 
again. When I focus on the sun, I can see the car in my peripheral vision. If I shift my 
focus to the car, I can still see the bright sun out of the corner of my eye. In applying 
learnings to data visualization, we should be aware of the tension that is introduced 
when we emphasize multiple things simultaneously in a graph or on a slide. 

FIGURE 4.1c: My eyes landed first on the Queens Bronx sign. This happened 
due to a few reasons. It's crisp compared to some blurred elements in the photo. 
The sun is shining on this sign in a way that highlights it. It's larger than the other 
signs. Because of the large size and relatively fewer words, there is also more 
blank space, which really makes it stand out against the busy background. It ap¬ 
pears first in the arrangement of the signs, so I find that my eyes go there initially, 
then continue rightward. Also take note of the various preattentive attributes to 
make things stand out in different ways on the signs themselves: bold, all caps, 
arrows, color (yellow). Speaking of color, the Exit Only portion of the Staten Island 
sign is also attention grabbing. There's a lot going on in this picture, which can 
complicate the process of getting everyone to look at the same thing first. Still, 
there are things to be learned from this. 

How can we apply similar aspects when we visualize data? Keep key elements 
crisp and legible. Highlight strategically to make one thing in a row of similar 
things stand out. Make more important things bigger (and as a corollary: size 
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things of similar importance consistently). Be conscious of how we organize ele¬ 
ments on a page and try to do so in a way that draws our audience's eyes how we 
would like them to move. 

FIGURE 4.1 d: My eyes go immediately to the yellow car. Do me a favor and flip 
back to Figure 4.1 d and do the exercise again. Notice both where your eyes go 
first and where they go next. Mine land on the car, then follow the road downward 
to the left. Other people may look at the car, then continue along the curvy road 
upward to the right. We didn't spend much time looking at the trees in the upper 
left or bottom right. 

When we think about our graphs and slides, we want to be aware of how we are 
directing attention—either intentionally or inadvertently. Make sure you aren't ac¬ 
cidentally directing your audience's attention away from something at which you 
want them to look. 

FIGURE 4.1e: My eyes had trouble landing on anything in this colorful collec¬ 
tion of cars. They bounced around from blue to yellow to red. Colorful is a good 
goal for a car dealership who wants to have the right color auto for everyone, but 
it's not a great goal when visualizing data. By making so many things different, 
we actually lose the potential strategic preattentive value of color. With so many 
shades, it's difficult to create sufficient contrast to focus our audience's eyes. Color 
used sparingly is one of the most effective ways to direct our audience's attention 
to where we want them to look. Check out Figure 4.If for evidence. 



FIGURE 4.If Where are your eyes drawn? 
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Exercise 4.2: focus on... 

Let's continue to explore how we can focus attention and apply some of the learn¬ 
ings from Exercise 4.1 to graphs. When we visualize data, there are often various 
takeaways that we could highlight. It can sometimes be useful to show the same 
graph multiple times, each instance highlighting the point or points we want to 
focus on as we walk our audience through different nuances of the data. This 
allows them to know exactly where to look in the data as we are talking, or when 
they are reading corresponding text. Let's practice how we can achieve this with 
a specific example. 


^ See the following visual, which shows year-over-year change ("YoY," measured as 

percent change in dollar sales volume) for cat food brands from a pet food man¬ 
ufacturer. Answer the following questions, download the data, and employ the 
strategies you outline in your tool. 


Cat food brands: YoY sales change 

% CHANGE IN VOLUME ($) 


DECREASED I INCREASED 


- 20 % - 15 % - 10 % 



- 5 % 0 % 5 % 10 % 15 % 20 % 

Fran's Recipe 
Wholesome Goodness 
Lifestyle 
Coat Protection 
Diet Lifestyle 
Feline Basics 
Lifestyle Plus 
Feline Freedom 


Feline Gold 
Feline Platinum 
Feline Instinct 
Feline Pro 
Farm Fresh Tasties 
Feline Royal 
Feline Focus 
Feline Grain Free 
Feline Silver 
NutriBalance 
Farm Fresh Basics 



FIGURE 4.2a Let's focus attention in this graph 


QUESTION 1: Let's say you will be presenting this data live and want to begin by 
talking about the Lifestyle brand line: Lifestyle, Diet Lifestyle, and Lifestyle Plus. 
How would you visually indicate to your audience to look at those points of data? 
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QUESTION 2: Assume you next want to talk about the Feline brand group, 
which includes all of the brands with "Feline" in their name. The branding for this 
line of cat food has a purple logo. How would you indicate to your audience that 
they should focus here? 

QUESTION 3: Next, you want to discuss the brands that had year-over-year de¬ 
clines. How could you draw your audience's attention there? 

QUESTION 4: Let's imagine that within the declining brands, you want to talk 
specifically about the two brands that declined the most: Fran's Recipe and 
Wholesome Goodness. How might you achieve this? 

QUESTION 5: Assume you want to talk about the brands that had year-over-year 
increases in sales. How would you draw your audience's attention there? What is 
similar to how you directed attention to the decreasing brands? Would you do 
anything different in comparison? 

QUESTION 6: You want to create a final comprehensive view to be distributed 
that highlights each of the takeaways outlined previously: Lifestyle brands, Feline 
brands, decreasing brands (differentiating those decreasing most), and increasing 
brands (highlighting those that increased most). How would you achieve this? 
How would you pair this with explanatory text and make it clear how the text 
relates to the data? 
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Solution 4.2: focus on... 

We have a couple of elements at our disposal when it comes to drawing attention 
in this exercise: the data itself, and also the data labels that list the various brands. 
Color and bold type will be my primary tools for directing focus in this example. In 
my illustrations, I'm also going to use the title text to briefly describe the takeaway 
I'm highlighting, moving some of the original detail it contained into the subtitle. 

QUESTION 1: When it comes to highlighting the Lifestyle brands, I decided to 
make those data points and the labels that go with them black, plus bold the label 
text. Other colors would work, too, but in the absence of much other context, I 
chose to stay neutral in this initial view. We'll look at more ways to use color and 
related considerations as we get further into this solution. 

So that the specific data points really stand out, I made the other data and labels a 
slightly lighter shade of grey. I also used title text in matching black bold to briefly 
call out the takeaway. 


Cat food brands: Lifestyle line brands declined 

YEAR-OVER-YEAR % CHANGE IN SALES VOLUME ($) 


DECREASED | INCREASED 


- 20 % - 15 % - 10 % 


- 5 % 


0 % 


5 % 


10 % 15 % 20 % 
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Coat Protection 

Diet Lifestyle 

Feline Basics 

Lifestyle Plus 

Feline Freedom 


Feline Gold | 
Feline Platinum | 

Feline Pro| 
arm Fresh Tasties | 
Feline Royal | 
Feline Focus | 
Feline Grain Free | 
Feline Silver | 
NutriBalance| 
-arm Fresh Basics! 


FIGURE 4.2b Focus on Lifestyle brand line 






focus on... 


159 


QUESTION 2: G iven the purple brand color, I could use this to emphasize the 
Feline line of brands, again pairing with bold brand labels and a matching graph 
title. We are using the preattentive attribute of hue, or color, to direct attention 
and the Gestalt principle of similarity (of color) to tie the spatially separated el¬ 
ements together. When it comes to other Gestalt principles, we might consider 
using position and put all of the Feline brands at the top of the graph; however, 
that would interfere with the thoughtful ordering and make the graph more difficult 
to understand. 


Cat food brands: most in Feline line increased 

YEAR-OVER-YEAR % CHANGE IN SALES VOLUME ($) 


DECREASED | INCREASED 


- 20 % - 15 % - 10 % - 5 % 0 % 5 % 10 % 15 % 
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20 % 


FIGURE 4.2c Focus on Feline brand line 
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QUESTION 3: To draw attention to the brands that had decreases in year-over- 
year sales, I might choose a color that reinforces this negative point. I tend to 
avoid red and green for bad and good connotation, respectively, because of the 
inaccessibility for those who are colorblind (red/green colorblindness is the most 
prevalent, affecting nearly 10% of the population). I'll often use orange for nega¬ 
tive and blue for positive, as I feel you still get the desired connotation. See Figure 
4.2d, which uses orange to highlight the decreasing brands. In addition to the 
graph title, data points, and brand labels, I also made the Decreased title at the 
top orange. I chose not to bold the data labels, because I felt sufficient attention 
was already drawn by making them orange and the bold felt a little excessive. 


Cat food brands: 8 brands decreased in sales 

YEAR-OVER-YEAR % CHANGE IN SALES VOLUME ($) 


DECREASED | INCREASED 


- 20 % - 15 % - 10 % - 5 % 0 % 5 % 10 % 15 % 
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20 % 


FIGURE 4.2d Focus on decreasing brands 
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QUESTION 4: To draw attention to the two brands that decreased the most, I 
could make those orange and everything else grey. If I'm progressing to this point 
from the view in Figure 4.2d, however, I could do this another way: keep all the 
decreasing brands orange, but vary intensity to draw attention to the two that 
decreased the most. See Figure 4.2e. 


Cat food brands: 2 brands decreased the most 

YEAR-OVER-YEAR % CHANGE IN SALES VOLUME ($) 

DECREASED | INCREASED 

- 20 % - 15 % - 10 % - 5 % 0 % 5 % 10 % 15 % 20 % 

Fran's 

!' ■ Wholesome Goodness 


H Lifestyle 
■ Feline Fr 
Feline Gold | 

Feline Platinum | 

Feline Instinct ■ 

Feline Pro^H 
Farm Fresh Tasties H 
Feline Royal |BHU 
Feline Focus B 
Feline Grain Free HUB 
Feline Silver BI^^H 
NutriBalance BBHH 
Farm Fresh Basics ■ 


FIGURE 4.2e Focus on the brands that most decreased 
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QUESTION 5: To draw attention to brands having increasing sales, I could use 
blue, for the reasons described in my response to Question 3. See Figure 4.2f. 


Cat food brands: 11 brands flat to increasing 

YEAR-OVER-YEAR % CHANGE IN SALES VOLUME ($) 


DECREASED | INCREASED 


- 20 % - 15 % - 10 % - 5 % 0 % 5 % 10 % 15 % 20 % 


Fran's Recipe 
Wholesome Goodness 
Lifestyle 
Coat Protection 
Diet Lifestyle 
Feline Basics 
Lifestyle Plus 
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Feline Platinum 
Feline Instinct 
Feline Pro 
Farm Fresh Tasties 
Feline Royal 
Feline Focus 
Feline Grain Free 
Feline Silver 
NutriBalance 
Farm Fresh Basics 



FIGURE 4.2f Focus on increasing brands 


QUESTION 6: Fi nally, if I want to pull a number of these observations together, 
I might do so in two comprehensive slides. This would allow those processing the 
information on their own to get a similar progression to what I'd show in a live 
setting. Note that my text is mostly descriptive (or fabricated!)—ideally we'd have 
additional context on what drove the changes we're seeing, and perhaps related 
information to share, or a specific point to make or discussion to drive. 

To me, this felt like too much to pack into a single slide, so I broke it into two views 
of the data in order to highlight everything that we did step by step previously. 
See Figures 4.2g and 4.2h. 
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Cat food brands: mixed results in sales year-over-year 

YEAR-OVER-YEAR % CHANGE IN SALES VOLUME ($) 


DECREASED | INCREASED 


-20% -15% -10% -5% 0% 5% 10% 15% 20% 
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Wholesome Goodness 
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Coat Protection 

Diet Lifestyle 
Feline Basics 
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Feline Platinum 
Feline Instinct 
Feline Pro 

Farm Fresh Tasties 

Feline Royal 
Feline Focus 
Feline Grain Free 
Feline Silver 

NutriBalance 
Farm Fresh Basics 



Brands in the Lifestyle line all 
decreased year-over-year, mainly 
due to a marketing shift away from 
these products. Classic Lifestyle had 
the biggest decrease in sales, down 
10% year-over-year, while Lifestyle 
Plus had the smallest decrease (4%). 

Most brands in the Feline line 
increased in sales year-over-year, 

largely due to the partnership with 
PetFriends retailers that we entered 
into mid-year. We anticipate continued 
momentum in the coming year. 


FIGURE 4.2g Comprehensive slide #1 : Lifestyle and Feline brands 


Cat food brands: mixed results in sales year-over-year 

YEAR-OVER-YEAR % CHANGE IN SALES VOLUME ($) 


DECREASED | INCREASED 


-20% -15% -10% -5% 0% 5% 10% 15% 20% 



Fran's Recipe 
Wholesome Goodness 

Lifestyle 
Coat Protection 
Diet Lifestyle 
Feline Basics 
Lifestyle Plus 
Feline Freedom 


Feline Gold 
Feline Platinum 
Feline Instinct 
Feline Pro 
Farm Fresh Tasties 
Feline Royal 
Feline Focus 
Feline Grain Free 
Feline Silver 
NutriBalance 
Farm Fresh Basics 



Eight key cat food brands declined in 
sales year-over-year, with five brands 
decreasing 7%+. This was expected 
in some cases due to focus shift 
toward higher margin brands. Fran's 
Recipe and Wholesome Goodness 
each declined by more than 13%, 
which was more than expected. 

On the positive side, five brands 
increased 8%+ year-over-year, with 

marked 16%+ increases for 
NutriBalance and Farm Fresh 
Basics. 

What can we learn from increasing 
brands that we can apply 
elsewhere? Let's discuss next steps. 


FIGURE 4.2h Comprehensive slide #2: decreasing and increasing brands 
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Exercise 4.3: direct attention many ways 

As we saw in Exercise 4.2, color used sparingly can work well to direct our audi¬ 
ence's attention to where we want them to look. But color is not the only visual 
element we can use to do this. More broadly, preattentive attributes are hugely 
important tools in our toolkit when it comes to creating effective visual designs. 
In addition to color (hue), these are things like size, position, and intensity that— 
when used thoughtfully and sparingly—help us to create contrast and direct our 
audience's attention. In other words, there are a lot of different attributes of our 
visual designs that we can play with to achieve this, and various circumstances or 
constraints may cause us to employ different strategies. Let's look at a specific ex¬ 
ample and explore the numerous ways we could indicate to our audience where 
we want them to look. 

Check out the following graph, which shows conversion rate over time by acquisi¬ 
tion channel. Assume you'd like to draw your audience's attention to the Referral 
line. How could you use preattentive attributes to do so? How many different 
ways can you come up with to focus your audience's attention? List them! To 
take it a step further, apply the strategies you've listed using your tool of choice. 


Conversion rate over time 
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FIGURE 4.3a How could we focus attention on the Referral line in this graph? 
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Solution 4.3: direct attention many ways 

I'm going to illustrate 75 ways to indicate to my audience that I'd like them to 
focus on the Referral trend. Is your list shorter than this? If so, go back and see if 
you can generate a couple more ideas. 

Ready? Let's check out the various ways we could direct attention. We'll start with 
a few brute force options, then get more nuanced from there. 

1. Arrow. We could use an arrow to literally point to our audience what we'd like 
them to look at: the Referral line. 

Conversion rate over time 
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FIGURE 4.3b The "look here" arrow 

2. Circle. We could circle the Referral line. Yes, this is another blunt tool. When it 
comes to the arrow and the circle—I love and hate these approaches with equal 
measure. I love the fact that it means someone looked at the data and said "I 
would like you to look here" and then did something to make that happen. The 
challenge is that the arrow, the circle'—these are elements that we've added and, 
in and of themselves, carry no informative value. So from that standpoint, they 
add clutter. Still, they are better than nothing: I'd rather have a blunt tool that 
indicates to my audience where they should pay attention than nothing at all. But 
even better if I can find some aspect of the data to change to create this emphasis. 
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Conversion rate over time 



2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 
FISCAL YEAR 


FIGURE 4.3c Circle the data 

3. Transparent white boxes. Let's look at one more brute force method before 
we get more elegant: transparent white boxes. This method can be useful if you 
ever have to take a screenshot from a tool and don't have the ability to change 
the design of the data. Use transparent white boxes to cover up everything you 
want to push to the background. This has the effect of reducing the intensity of 
everything that is covered, while leaving what you want to draw attention to in full 
intensity. See Figure 4.3d. 


Conversion rate over time 
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FIGURE 4.3d Cover everything else up with transparent white boxes 
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Depending on the shape of your data, you may have to use multiple boxes or 
other shapes to fully cover. If you look closely at Figure 4.3d, I didn't do a perfect 
job—near where the lines overlap in the middle of the graph, the Organic line 
didn't fully get covered. Figure 4.3e outlines my various transparent white boxes 
(some of them rotated to better fit the data) in black so you can see the monkey¬ 
ing that has to happen at times to make this work well. 


Conversion rate over time 
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FIGURE 4.3e Highlighting the transparent white boxes 

This is another brute force method, but depending on your constraints it can 
sometimes be useful. Next, let's look at some more elegant approaches for di¬ 
recting attention. 

4. Thicken the line. We could make the Referral line thicker or the others thinner 
or a combination of these. We can also manipulate the word "Referral." In this 
case, I've also thickened it by making the text bold. See Figure 4.3f. 
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Conversion rate over time 
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FIGURE 4.3f Thicken the line 


5. Change the line style. Varying line style is another way to signal that something 
is different and direct attention. Dashed or dotted lines are super attention grabbing 
when they appear together with solid lines. The challenge is that from a cognitive 
burden standpoint, we've taken what could have been a single element (a line) and 
chopped it into many pieces. This adds some visual noise. Because of this, I recom¬ 
mend reserving the use of dotted lines for when there is uncertainty to depict: a fore¬ 
cast, a prediction, or a target or goal of some sort. In these cases, the visual sense of 
uncertainty that you get with the dotted line makes up for the additional visual clutter 
it introduces. 


Conversion rate over time 
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FIGURE 4.3g Change the line style 
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6. Leverage intensity. We can make the line we want to emphasize darker in color. 
See Figure 4.3h. 

Conversion rate over time 
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FIGURE 4.3h Make it darker 

7. Position in front of other data. Position is another preattentive attribute. We 
can't change the order of the data when it comes to a line graph—it is where it is 
because of the data it plots. But we can take steps to ensure it doesn't fall behind 
other data. Take note of Figure 4.3h in the middle of the graph, where the grey 
Organic line crosses in front of Referral. We can pull the latter forward to correct 
this (this is typically dictated by the data series order, which you can modify in 
most tools). See Figure 4.3i. 


Conversion rate over time 
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FIGURE 4.3i Position in front of other data 


PRACTICE k/4 COLE 



170 


focus attention 




8. Change the hue. We can change the hue, or color, of the line we want our au¬ 
dience to focus on, leaving everything else grey. See Figure 4.3j. 


Conversion rate over time 
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FIGURE 4.3j Change the hue 

9. Use words to prime audience. In Figure 4.3k, I've added a takeaway to the title 
about the Referral data. Once my audience reads this, they know to look for the 
Referral line in the graph. We'll look at more examples of takeaway titles when we 
talk about words in the context of story in Chapter 6. 



Conversion rate over time: Referral decreasing markedly since 2010 
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FIGURE 4.3k Use title words to prime audience 
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10. Eliminate the other data. One way to get the audience to focus on the data we 
want them to look at would be to eliminate all the other data, making it the only line 
at which they can look. You should always ask yourself whether you need all of the 
data you are showing. But also consider: any time you debate removing data, what 
context you lose when you do so and whether this tradeoff makes sense given what 
you need to communicate. 


Conversion rate over time 
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FIGURE 4.31 Eliminate the other data 


11. Animate to appear. Though difficult to show in a static book, motion is the 
most attention-grabbing preattentive attribute and can work very well in a live 
setting (where you are presenting the graph and can flip through various views). 
Imagine we start with an empty graph that only has the x- and y-axes. Then we 
could add a line representing the Total conversion rate and discuss. Next, I could 
layer on the Organic conversion rate and talk about that. Finally, I could add the 
Referral line. The simple fact of it not being there and then appearing would gar¬ 
ner attention. 

The challenge with motion is that it's also easily annoying. The only animation I 
recommend is appear, disappear, and transparency. No flying, bouncing, or fad¬ 
ing—these add glitz without value and are another form of visual clutter. 
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12. Add data markers. Reverting back to showing all the data, we can add data 
markers to draw attention. See Figure 4.3m. 


Conversion rate over time 
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FIGURE 4.3m Add data markers 


13. Add data labels. Taking it a step further, we could also add data labels to the 
various points on the line we want to emphasize. This is a way of saying to our 
audience, "Hey, this part of the data is so important, I even added some numbers 
there to help you interpret it." See Figure 4.3n. Text annotations that explain ad¬ 
ditional context or point out nuances in the data you want your audience to focus 
on can achieve a similar effect. 
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FIGURE 4.3n Add data labels 
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When we add data markers and data labels to every single data point, we can 
sometimes end up with a cluttered mess. That said, when we are sparing about 
which points we choose to put markers and labels on, we can direct our audience to 
make certain comparisons within the data. Let's check out an example of this next. 

14. Employ end markers and labels. By putting end markers and labels on each 
line, as I've done in Figure 4.3o, one easy and obvious comparison for my au¬ 
dience to make is how the different conversion rates compare to each other as 
of the most recent point of data. This doesn't draw attention to the Referral line 
specifically, but we'll do that again in our next step. 


Conversion rate over time 
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FIGURE 4.3o Employ end markers and labels 


15. Combine multiple preattentive attributes. We can use multiple preattentive 
attributes to really make it clear where we want our audience to look. In Figure 
4.3p, I've used words in the title to prime my audience (making them the same 
color as the data they describe, leveraging the Gestalt principle of similarity), and 
I can make the line I want my audience to pay attention to thicker, colored, add 
data markers and data labels. Annotations can also help explain additional con¬ 
text for the data of interest. 
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Conversion rate over time: referral decreasing markedly since 2010 
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FIGURE 4.3p Combine multiple preattentive attributes 

Practice the "Where are your eyes drawn?" test with Figure 4.3p. Where do your 
eyes go first? Where do they go next? What about after that? 

When I close my eyes and then open them and look at Figure 4.3p, my eyes go 
first to the title text in red. Then they jump down to the red line in the graph. 

I move to the right and can easily compare the most recent data point (2019) 
between Referral conversion rate and Organic and Total. I can move my eyes 
leftwards to read additional detail via the annotations of what is driving some of 
what is being shown. In this way, I've used preattentive attributes to both direct 
attention and create visual hierarchy, making my overall visual easier for my audi¬ 
ence to consume. Success! 
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Exercise 4.4: visualize all the data 


Let's revisit an example we looked at in Chapter 2. You may recall the scenario 
where you work at Financial Savings and want to compare your bank's perfor¬ 
mance against your peers'. You have data on bank index (branch satisfaction) over 
time for your bank plus a number of your competitors. The original graph is shown 
in Figure 4.4a. 
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FIGURE 4.4a Bank index 


We previously looked at an approach where we changed from this dot plot to a 
line graph and summarized all of the competitor data with a single average line 
(see Solution 2.7). 

But what if we want to show all the data? How could we achieve this without it 
being overwhelming? Download the data and create your preferred view. 
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Solution 4.4: visualize all the data 

We can get away with showing quite a lot of data if we push most of it to the 
background. 

I've had people tell me at workshops before that they didn't realize the power of 
grey. This muted color works well for things that need to be present (axis labels, 
axis titles, non-message-impacting data) but don't need to draw a ton of atten¬ 
tion. Check out how the strategic use of grey in Figure 4.4b helps us in this case. 


BRANCH SATISFACTION 

Financial Savings below industry for first time in 5 years 
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FIGURE 4.4b Can show all the data if we push most of it to the background 


In addition to making the competitor banks grey, I also made the lines for that 
data thinner than both the Industry average and Financial Savings. This is another 
means for de-emphasizing them, so they are there for reference but not drawing 
attention. If there's an individual competitor bank we want to identify, that be¬ 
comes difficult (we could cycle through various competitors through sparing em¬ 
phasis in a live presentation, or label possibly one or two in a static view, though 
you can imagine how this will quickly get messy). If Financial Savings versus spe¬ 
cific competitors is important, then this isn't the best way to look at this data. In 
that case, I could focus on just the latest data point across the various banks and 
plot as a single horizontal bar chart. 
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But sticking with this view, let's take things a step further. Say that within all of this 
and in addition to directing attention to the Industry average and Financial Sav¬ 
ings, we want to make a point about the recent year. I can use an additional color 
to achieve this: see Figure 4.4c. 
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FIGURE 4.4c Could focus on latest year-to-year period of time 

Sparing emphasis allows us to direct attention even when showing a lot of data. 
Consider how you can employ this tactic in your own work. 

Now that you've seen me solve some exercises related to focusing attention, let's 
shift to exercises for you to tackle on your own. 


PRACTICE w'Jl COLE 


Practice y^ r own 


178 


focus attention 


PRACTICE 
on ywr 

0)Nti 


Let? loop at" Wore pictures to understand 
the subtleties of what gamers attention 
and horn we can utilize, those dimensions 
when cowimiAnicatincj. There's not a single 
Iway to do this — there are many, which 
we'll continue to explore in the following 
exercises 


Exercise 4.5: where are your eyes drawn? 

As we've seen, observing where our eyes land first in a graph or on a slide can 
help us determine whether we're using our preattentive attributes strategically to 
direct attention to the most important part and create clear visual hierarchy. Let's 
do some additional practice with this simple test. 

Consider the following visuals. For each, close your eyes or look away, then look 
back at it and take note of where your eyes go first. Why is this? What can you 
learn from this activity that you can generalize to how to effectively communicate 
with data? Write a short paragraph answering these questions for each image. 



FIGURE 4.5a Where are your eyes drawn? 
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FIGURE 4.5b Where are your eyes drawn? 
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FIGURE 4.5c Where are your eyes drawn? 



FIGURE 4.5d Where are your eyes drawn? 
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FIGURE 4.5e Where are your eyes drawn? 



FIGURE 4.5f Where are your eyes drawn? 
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Exercise 4.6: focus within tabular data 

While the examples that we've looked at in this chapter so far have all been imag¬ 
es and graphs, we can use preattentive attributes to direct attention in tables, too. 

Check out the following table, which shows data about the latest four weeks of 
sales for the top ten sales accounts for a popular brand of coffee. Answer the 
following questions. 


Wakellp Coffee 

Top 10 accounts: 4-week sales ending January 31st 


Account 

Sales 

Volume 

% Change 
vs prior 

Avg # of 
UPCs 

% ACV 
Selling 

Price per 
Pound 

A 

$15,753 

3.60% 

1.15 

98 

$10.43 

B 

$294,164 

3.20% 

1.75 

83 

$15.76 

C 

$21,856 

-120% 

1.00 

84 

$12.74 

D 

$547,265 

5.60% 

1.10 

89 

$9.45 

E 

$18,496 

-4.70% 

1.00 

92 

$14.85 

F 

$43,986 

-2.40% 

2.73 

92 

$12.86 

G 

$86,734 

10.60% 

1.00 

100 

$17.32 

H 

$11,645 

37.90% 

1.00 

85 

$11.43 

1 

$11,985 

-0.70% 

1.00 

22 

$20.82 

J 

$190,473 

-8.70% 

1.00 

72 

$11.24 


UPC is the Universal Product Code, the barcode symbology. 

ACV is All-Commodity Volume, measured as a percentage from 0 to 100. 

FIGURE 4.6 Practice focusing attention within this table 

QUESTION 1: Let's assume Sales Volume is the most important data in this table 
and that the rest of the data is there for additional context or because we know 
people in our audience will want to see it. We've already positioned it as the first 
column of data, but what else can we do to direct attention or make this data 
easier to process? 


QUESTION 2: Account D is much bigger in terms of sales volume than any of the 
other accounts, yet it takes some time staring at this table to figure that out. How 
could we draw our audience's attention to Account D more quickly? List three 
specific strategies you could use to set this row apart from the rest. Which do you 
like best and why? 


QUESTION 3: Let's continue with ourfocus on Account D. What if, within Account 
D, we wanted to highlight the low Price per Pound? How could you achieve this? 


QUESTION 4: Let's reset and say you want to focus attention on the relative 
Price per Pound within this table. Does this change where you would position this 
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column or how you might order the rows of data? What are three different ways 
you could indicate to your audience you want them to focus here? 

QUESTION 5: Make the changes you've outlined. Download the data and tackle 
in the tool of your choice. 


Exercise 4.7: direct attention many ways 

As we've discussed, we have many options for indicating to our audience where 
they should pay attention in the data we show. 

Take the following example, which shows market share over time for a given prod¬ 
uct. Let's assume we want to direct our audience's attention to Our Product. In 
what ways could you use preattentive attributes to do so? How many different 
methods can you come up with to focus your audience's attention? List them! 
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FIGURE 4.7 How could we direct attention to Our Product in this graph? 

To take it a step further, download the data and apply the strategies you've listed 
using your tool of choice. 
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Exercise 4.8: how can we focus attention here? 

You've practiced focusing attention with tables and lines; next let's take a stab at 
doing the same with bars. 

Let's revisit an example from Chapter 2. Imagine you work for a regional health 
care center and want to assess the relative success of a recent flu vaccination ed¬ 
ucation and administration program across your medical centers. Figure 4.8 is a 
slightly modified version of the original graph. 

Let's assume we want to direct our audience's attention to those medical centers 
that are above average. In what ways could you use preattentive attributes to do 

so? How many different methods can you come up with to focus your audience's 
attention? List them! 


Successful Opportunities by Center (FLU) 



MEDICAL CENTER 

FIGURE 4.8 How can we focus attention on above average medical centers? 

To take it a step further, download the data and practice applying the various ways 
you've listed for directing attention in your tool. 
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The following exercises will help you direct 
attention in your visual communications. 


Exercise 4.9: where are your eyes drawn? 

Your eyes and attention are a good initial proxy for your audience. After you cre¬ 
ate a graph or a slide, close your eyes or look away. Look back at it, taking note of 
where your eyes land first. Is this consistent with the place you want your audience 
to focus their attention first and foremost? If not, what changes can you make to 
achieve this? Consider how you are using preattentive attributes sparingly both to 
direct attention and create visual hierarchy. 

Recognize, however, that since you are the one who designed the data, you al¬ 
ready know things about it and are predisposed to focus on certain aspects in 
ways that your audience may not. Given this, after you've practiced the "Where 
are your eyes drawn?" test and iterated as needed to be happy with the result, 
solicit the assistance of someone else. Grab a friend or colleague and show them 
your graph or slide and ask them where their eyes go first. Is it to the place you 
want? Use this information to continue to iterate as needed. 

Beyond where their eyes go first, have them talk you through how they process 
the information. What do they pay attention to first? After that? And then? What 
questions do they have? What observations do they make? Understanding this 
from someone who isn't as close to the data and information will give you import¬ 
ant insight into what is working and whether and how you can make changes so 
that your audience knows where to look and how to process the information you 
put in front of them. 
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Exercise 4.10: practice differentiating in your tool 

There are many different tools that can be used to visualize data. Each comes 
with its own set of abilities and constraints. To be effective in the way we visualize 
and communicate with data, we need to get to know our tools well enough to 
apply the various strategies that are covered here and in SWD. In some cases, this 
might mean writing code. The beauty of code is that once you've written it, there 
are likely lines or chunks of it that you can repurpose later (win!). Or it may mean 
finding the right combination of drop-down menus and selections in your tool 
(that you have to either template or apply each time: that's okay, it gets quicker 
with experience). 

In any case, let's practice getting to know our tool—-and what we can do with it—a 
bit better. 

Take a graph you have created. This can be anything. If you don't have a work 
example handy to use for this, you can select data from any of the exercises in this 
book to download and create a graph with which to play. Create a line graph or 
bar chart. Figure out how to achieve the following in your tool of choice. 

Bold/thick: Pick a text element within your graph and make it bold. Make a single 
line or bar thicker than those around it. 

Color: Start by making everything grey. Pick a single line or series of bars and 
make them blue. Pick another and make it match your organization's primary 
brand color. Figure out how to take an individual data point—a point in a line 
graph or a single bar in a series—and change the color of just that point. 

Position: Let's practice moving things around. If you are working with a bar chart, 
reorder the bars: make them ascending and then descending. If you are working 
with a line graph and have lines that cross each other, pick one and figure out how 
to move it in front of or behind the others. 

Dotted or dashed line: Are there any lines in what you're showing that you can 
make dotted or dashed? I bet there are. If you are working with a line graph, fig¬ 
ure out how to change the line style of one of the lines. If faced with a bar chart, 
determine how you could do this for the outline of one (or more) of the bars. 

Intensity: Vary intensity by rendering some data in full intensity and the rest in 
a lesser intensity. You can do this by applying transparency, a pattern, or simply 
picking a less intense color. Consider both how you can do this by modifying the 
formatting of the data directly, as well as whether or how you could use transpar¬ 
ent boxes or other shapes to achieve this effect in a brute force manner. 
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Label data points: Start by adding labels to an entire data series. Next, figure out 
how to move them around. On a line graph, position the labels above the data 
series, then below. In a bar chart, label them on top of bars, then pull the labels 
inside the ends of the bars. Next, determine how you'd approach it if you only 
wanted to label a single data point (or a couple of data points). If you're using a 
graphing application (not writing code) there are brute force solutions for add¬ 
ing one at a time or deleting individual labels. You might add another series of 
data (and make it invisible but use the positioning for labels, as one example) to 
streamline your process. 

What else do you want to learn how to do in your tool? Make a list and determine 
what resources (colleagues, smart online searches, perhaps classes or tutorials) 
can help you achieve your goals. Learning any tool takes time. But it is nearly 
always time well spent. There is no better satisfaction than when you can use your 
tool to fully meet your needs! 
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Exercise 4.11 : figure out where to focus 

SWD and this book generally make a big assumption: you've thoroughly analyzed 
your data and already have something specific that you want to communicate 
to your audience. I tend to draw a distinction between exploratory analysis and 
explanatory analysis, and assume that the former has been done and focus on 
teaching the latter. This sometimes leads to the question: how do you figure out 
where to focus in the first place? 


This is a harder piece to teach and is as much art and science as what we focus 
on here with explanatory communication. While I characterize exploratory and 
explanatory as distinct phases, in reality there isn't a solid line between the two. 
Often, we cycle back and forth through each over the course of a project. When 
it comes to the "Where do I focus?" query, there are some questions you can ask 
yourself to help navigate. Consider the following (incomplete) list. 

• When is it appropriate to aggregate the data? 

• When and how should you disaggregate the data? 

• What is the right time frame to consider? How far back should you go? 

• How does it make sense to break the data down? Look at things by line of 
business, region, product, tenure, or other categories. Where are things simi¬ 
lar? Where are they different? Why is that? 

• Do things align with what you expect? In what instances are they different? 

• How do different things relate to each other? Do some things drive others? 

• What comparisons are meaningful or will lead to potential insight? 

• What context may be useful that you don't have? Who can you ask about this? 

• What questions could someone else looking at this data have? 

• What assumptions are you making? How big of a deal is it if those assump¬ 
tions are wrong? 

• What is missing? Data doesn't typically tell the whole story. How can you ad¬ 
dress or understand the missing pieces? 

• Is history likely to be the same or different as the future? 
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Exercise 4.12: let's discuss 

Consider the following questions related to Chapter 4 lessons and exercises. Dis¬ 
cuss with a partner or group. 

1. What design elements do we have at our disposal for directing attention 
when visualizing and communicating with data? Which do you find most 
effective and why? 

2. What is the "Where are your eyes drawn?" test? When and why would you use it? 

3. There are numerous ways to direct attention in text, tables, points, lines, and 
bars. What are common ways to indicate to your audience where you want 
them to focus? How are the means by which you can achieve this across var¬ 
ious graph types different? 

4. What things are important to keep in mind when choosing the color(s) you 
use in your graphs? Are there any color combinations you will embrace or 
avoid going forward? Why is that? 

5. How is sparing emphasis for explanatory communications different from how 
you would design a dashboard where the data is meant to be explored? How 
might you approach the use of color in a dashboard compared to when there 
is a specific takeaway you want to highlight? 

6. What is visual hierarchy? Why is it useful to create visual hierarchy in your 
data visualizations and the pages that contain them? 

7. Why does emphasis need to be sparing to be effective? 

8. What is one specific goal you will set for yourself or your team related to the 
strategies outlined in this chapter? How can you hold yourself (or your team) 
accountable to this? Who will you turn to for feedback? 


PRACTICE aA WORK 



chapter five 


think like 
a designer 

You know what great design looks like when you see it, but how do you actually 
achieve it—particularly if you don't consider yourself a designer? SWD covered 
four topics to help you think like a designer: affordances, aesthetics, accessibility, 
and acceptance. In this chapter, we'll practice applying these concepts and illus¬ 
trate how minor changes can help take your visual from acceptable to exception¬ 
al. First, let's cover a quick reminder of what I mean by these terms. 

In visual design, affordances are things we do to make it clear how to process 
what we show. This builds off of the lessons you've practiced in Chapters 3 and 
4: tie related things visually together, push less important elements to the back¬ 
ground, and bring the critical stuff forward. Direct your audience's attention inten¬ 
tionally to where you want them to look. 

Spending time on the aesthetics of your visuals can translate into people taking 
more time with your work or having the patience to overlook issues. Attention to 
detail comes into play: often many seemingly minor components add up to create 
a great or poor experience. To achieve the former, we must edit ruthlessly. 

People are each different, and accessibility means recognizing this and working 
to create designs that are usable by people of diverse skills and abilities. We've 
touched on colorblindness, but that only scratches the surface. We'll undertake 
exercises that will help you think about your designs more robustly. There is one 
simple thing that can help us improve the accessibility of our graphs broadly: using 
words wisely. 

Finally, our visual designs only work if our audience accepts them and there are 
things we can do to make this more likely, which we'll explore. 

Let's practice thinking like a designer! 

First, we'll review the main lessons from SWD Chapter 5. 
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Exercise 5.1: use words wisely 

When we communicate with data, people sometimes have the false belief that 
words have no place or should be kept to a minimum. But words play a critical role 
in making the numbers and graphs that we use to communicate data understand¬ 
able to our audience. The text we put on our graphs helps people comprehend 
what they are seeing and can assist in shaping their perceptions about the data. 

Let's do a quick exercise to illustrate the importance of words on graphs. 

Study Figure 5.1a, which shows sales over time for four brands of laundry deter¬ 
gent. There are already words on this graph: but are there enough? Could we 
use words more wisely? Consider these questions as you look at the data, then 
complete the following steps. 
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FIGURE 5.1a Could we use words more wisely? 


PRACTICE toti COLE 







PRACTICE \mM, COLE 


196 th ink like a designer 


STEP 1: What questions do you have about the data shown in Figure 5.1a? List 
them! What assumptions would you have to make to interpret this data? 

STEP 2: What words could you add to this graph to answer the questions you 
raised in Step 1 ? Freely make additions and changes to title and label so that what 
is being shown is perfectly clear. 

STEP 3: How could putting different words on this graph change the interpre¬ 
tation of the data? How can you change axis titles and other text to cause an 
alternate understanding of what this visual shows? What implications does this 
have for what words should be present on every graph? Write a paragraph or two 
summarizing your learnings from this exercise. 

STEP 4: For hands-on practice, write on Figure 5.1a or download the data or 
graph. Either add text to the existing graph or create a new one in the tool of your 
choice, practicing using words wisely to make the information accessible. 
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Solution 5.1: use words wisely 

When you create a graph, the details are almost always clear to you. The chal¬ 
lenge is that they aren't necessarily obvious to your audience, who may have dif¬ 
ferent expectations or understanding of the context. In absence of text to make 
the data comprehendible, your audience is left to make assumptions, just as you 
had to do in this exercise. Not only does this make you use more brainpower than 
necessary, but worse—those assumptions might be wrong! 

Let me take you through my approach to this exercise to illustrate how choice of 
words can lead us to completely different interpretations of the data. 

STEP 1: I have four main questions about this data. 

• What is graphed on the y-axis? We know from the titles that it represents 
sales, but that's not nearly descriptive enough. Are these actual number of 
units sold? Or hundreds of units sold? Or perhaps this represents monetary 
sales: for example, thousands of dollars, or millions of pounds. 

• What is graphed on the x-axis? The month labels clearly indicate time, but 
this doesn't tell us enough. What time period is this? Are we looking back at 
historical data, projecting into the future, or possibly some combination of 
the two? 

• What broader context do the four brands fit into? Do they represent all four 
brands carried on a particular website or at a specific store? Are they the four 
main brands of a given manufacturer? Or are they the top or bottom four 
brands of some greater population? 

• What realm does this data represent? Without any frame of reference, I could 
assume this is a robust representation (e.g. worldwide sales or US Sales). But 
it could be for some subsegment: a certain city, state, or region; a specific 
product line; a particular manufacturer; or a given chain of stores. 

Consider how different perspectives answering the questions raised above could 
lead us to totally different interpretations of this data. Let's look at that more spe¬ 
cifically next. 

STEP 2: Figure 5.1 b shows one way I could add words to this graph to answer the 
questions I raised in Step 1. 
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FIGURE 5.1b Clear title text aids understanding 

In Figure 5.1b, I assumed these represent unit sales for the four brands of laundry 
detergent sold at a specific store. I made this clear through titling: substituting a 
more descriptive graph title and adding axis titles to both the y- and x-axes. 

Let's review some specific design choices made with the text I added to this graph. 

I left-aligned the graph title. We've discussed the typical zigzagging "z" of infor¬ 
mation processing a couple of times already (in the solutions to exercises 2.1 and 
3.4, as well as in SWD). As a reminder, without other visual cues, your audience 
will start at the top left of your graph and do zigzagging "z's" to take in the infor¬ 
mation. By orienting our graph title at the top left, our audience hits what they are 
looking at before they see the actual data. This is the same reason for orienting 
my axis titles at the top (y-axis) and left (x-axis). 

I paid close attention to detail in the alignment of my axis titles, orienting the y-ax¬ 
is title to align with the top of the highest y-axis label, and the x-axis title is aligned 
at the left with the left-most axis label. I chose all caps for my y-axis titles (and will 
often do this for axis titles in general). Because capitalized letters are all the same 
height, this creates a neat rectangular shape (compared to what you'd get with 
mixed case: a jagged edge). I like the framing this lends to my graph. I also wrote 
the axis titles in grey text, so they are there to make it clear what we are looking 
at, but aren't drawing undue attention or distracting from the data. 

STEP 3: Alternate words could lead to a totally different interpretation about 
what this data is and represents. See Figure 5.1c. 
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FIGURE 5.1c Different text could lead to completely different interpretation 


This has implications for the words that should be present on every graph. I can 
generalize into a couple of guidelines. Every graph should have a title. When 
communicating with a slide deck, I use descriptive titles for my graphs and take¬ 
away titles for my slides (we'll talk more about the latter in Chapter 6). That's cer¬ 
tainly not your only option, and we've looked at examples in this book where the 
graph title is both descriptive and highlights a takeaway. Be consistent in how you 
title with a given report or presentation. 


Every axis should also have a title. Exceptions to this guideline are rare. Title ex¬ 
plicitly so your audience doesn't have to spend their brainpower trying to figure 
out or make assumptions about what they are viewing. 


Words make our visuals comprehensible for our audience. Use them! 
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Exercise 5.2: do it better! 

The graphing applications we use to visualize data are built to meet the needs 
of many different scenarios. This means that it's rare that the default settings will 
meet the needs of any one of those scenarios exactly. That's where we come 
in'—our understanding of the context and design sense can improve defaults tre¬ 
mendously, helping make information more easily digestible and simply more 
pleasant at which to look and with which to spend time. 

Let's dissect a specific example, considering how we can use lessons in design to 
improve upon default output from a proprietary tool and create a more desirable 
experience for our audience. See Figure 5.2a, which shows the number of cars 
sold by dealership over time for a given region. 



FIGURE 5.2a Default output from tool 


STEP 1: First, let's simply react to this graph. What words come to mind in terms of 
how this graph makes you fee/? Make a short list of the feelings this graph evokes. 

STEP 2: What changes would you make if you needed to communicate the data 
from this graph? Specifically, address: 

• Use of words: As we've discussed, words make our data interpretable. We 
should consider not only what words we use to do this, but also where we put 
those words. How and why would you make changes to the titles or place¬ 
ment of titles in this visual? Are there other ways you can improve upon the 
way words are used in this example? 
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• Visual hierarchy: We've learned it can be helpful to highlight sparingly and 
push non-critical or non-message-impacting elements to the background. 
How might you do that here? Which pieces of information or aspects of the 
design would you focus on and which would you de-emphasize or eliminate? 

• Overall design: Are there any elements of the design you find distracting 
currently? How could you more effectively use alignment and white space? 
What changes would you recommend making to the overall design of this 
information? 


STEP 3: Download the data and graph. Remake the visual applying the changes 
you've outlined in the tool of your choice. 

STEP 4: Imagine you have been asked to create a single slide focusing on this 
data that will fit into a broader deck to be shared with the management team who 
oversees these dealerships. How would that affect what you show or how you 
choose to show it? What additional words can you put around it to help it make 
sense? What other design considerations would you make? Create this slide in 
the tool of your choice. 
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Solution 5.2: do it better! 

STEP 1: My initial response to this graph brings to mind words like: confusing, 
chaotic, overwhelming, and complicated. These are reactions I'd like to avoid 
when I communicate with data! 

STEP 2: The following describes how I would approach remaking this graph to 
both better get the information across and foster a more pleasant overall experi¬ 
ence for my audience. 

Use of words: I like the fact that everything is titled in the original, but I'm not a 
fan of the center alignment of the graph and axis titles. I would upper-left-most 
justify all of the titles so that when my audience starts at the top left, they encoun¬ 
ter how to read the visual before they get to the data. I'll choose all caps for my 
y-axis title because of the nice rectangular framing that this, together with the 
graph title, creates for my graph. On the x-axis, we probably don't need the title 
of Quarter, as this is quite obvious from the individual labels. I'll omit this. There 
is currently a lot of redundancy with the x-axis labels given the repeated years, so 
I'll pull those out as super-category axis labels. 

When it comes to creating visual hierarchy, I have to decide what to focus on in 
this graph. In the original, it's difficult to focus on anything because so much is 
competing for our attention. I see that Regional Avg is emphasized in the original 
via a thicker black line (though this doesn't stand out nearly as much as it could 
given all the other lines, colors, and shapes). I'm going to push everything else 
to the background. When it comes to eliminating distractions, I'll also remove 
the grey background, borders, and gridlines. Getting rid of these non-informa¬ 
tion-bearing elements will help my data stand out more and make for a less clut¬ 
tered feeling visual overall. 

In terms of additional changes I would make to the overall design, it currently 
takes work going back and forth between the alphabetical legend at the right and 
the data it describes. I'd like to eliminate this work for my audience. My typical 
method for resolving this is to label the lines directly. This is challenging here 
because many of the lines are close together, but I'm still going to try it and get a 
little creative in the process. This won't be the best view to show what is going on 
for a given dealership (unless I put them into different graphs or emphasize only 
one or a couple at a time), but I can still give a sense of the highest, the lowest, 
and which fall generally in the middle when it comes to the most recent data by 
labeling in groups on the right-hand side of the graph. 
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STEP 3: Figure 5.2b shows my visual with these changes incorporated. 


Car sales over time 



FIGURE 5.2b Remade visual 


With Figure 5.2b, my audience can easily focus on the Regional Avg and also get 
a sense of the range and distribution over time across dealerships. If it's important 
to have a more specific understanding of what's happening for a given retailer, 
however, that's more difficult. If I need to solve for that as well, rather than try to 
do more with this graph, I might augment it with another view of the data. We'll 
look at that momentarily. 

STEP 4: If I'm given a single slide to use as my communication vehicle, I'd want 
to put more words around everything to make sure it makes sense and attempt to 
answer the question of "So what?" I'd use titling and text together with my visu¬ 
als—being conscious of white space and alignment—to create clear structure on 
the page. I'd also emphasize sparingly, both to create visual hierarchy and make 
the information scannable. This would help tie related elements together, easing 
the processing for my audience. See Figure 5.2c. 
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Regional car sales: mixed results 


OVERALL DECLINE IN REGIONAL AVERAGE 


MARKED VARIANCE BY DEALERSHIP 


The total number of cars sold across all dealerships (not 
shown) has decreased over time from more than 1,000 in 
Q1 2017 to 857 in Q3 2019 (a 17% reduction). The 
average number of cars sold by dealership has also 
decreased over time. 


In the latest quarter, Lakeside, Draper, 
and Filmore had the most cars sold 
(105,103 and 88, respectively), while 
Oakley, Pierce, and Mare Valley had the 
fewest (less than 40 cars sold each). 


Car sales over time 


Car sales by dealership: Q3 2019 



o i - 1 — 

01 02 03 04 01 02 03 04 01 02 03 

2017 2018 2019 


DEALERSHIP | * OF CARS SOLD 


20 40 80 


LAKESIDE 
DRAPER 
FILMORE 
WILDLAND 
SEALY 
NORTH 
ORLY 
REGIONAL AVG 
SOUTH LAKE 
BEACON 
ROSEDALE 
WESTLAKE 
OAKLEY 
PIERCE 
MARE VALLEY 



100 120 


Data source: Sales Database, includes cars sold onsite at regional dealerships through 9/30/19. 

FIGURE 5.2c Presenting on a single slide 

In Figure 5.2c, I added a second graph—horizontal bars showing how car sales 
across the various dealerships compare for the most recent time period. I'm mak¬ 
ing the assumption that this is the most relevant and that we don't necessarily 
need the full view over time for each (we can see the highs and lows with relative 
ease on the left, but if it's important to be able to distinguish the middle ones, that 
becomes impossible given the current design). 

I've added more text around the graphs, both clear concise titling and descriptive 
text to help make what I'd like to highlight to my audience clear. I've used white 
space and alignment to create a two-sided layout. If we step back and consider 
how our audience is likely to process this information, they will probably start 
at the top left, read the slide title, then move downward and read "Overall de¬ 
cline in regional average" and see the black line below in the graph that depicts 
this. Then they'd typically move to the right-hand side, perhaps pausing on the 
"Marked variance by dealership" title or the blue and orange text. Finally, they'd 
look down to the right graph, see the black average tied to the graph on the left, 
as well as the blue and orange bars that are connected through similarity of color 
to the above words. 

I did try out a second iteration of the left graph in Figure 5.2c that maintained con¬ 
sistent coloring of blue and orange for the top and bottom three dealerships as of 
Q3 2019 (those called out on the right). While I liked the consistency, I felt these 
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competed too much for attention with the Regional Avg on the left, so I decided 
to use these colors sparingly on the right graph only. 

The primary point here is to be thoughtful in the overall structure and design of 
your visuals and the pages that contain them. Don't simply rely on tool defaults; 
once you make a graph, there is still more work to be done. When we design 
thoughtfully, we can create a better experience for our audience, improving the 
odds of successful communication. 
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Exercise 5.3: pay attention to detail & design intuitively 

The following example employs a two-sided structure similar to where we ended 
in Exercise 5.2. However, clear structure is not the only thing we need for success. 
Attention to detail is a hugely important aspect of creating effective visual design. 
Let's look at another example and how attention to detail and thoughtful design 
choices can improve our visual communications. 

Let's assume you work for an on-demand print company that targets small busi¬ 
nesses. One of the metrics you track is customer touchpoints—how many times 
someone at your organization interacts directly with a customer—both in aggre¬ 
gate and on a per-customer basis. There are three primary modes of connection: 
phone, chat, and email. 

Your colleague has put together the following slide summarizing touchpoints over 
time and asked for your feedback. Spend a moment examining Figure 5.3a, then 
tackle the following. 

Total touchpoints and touchpoint per customer 
remains flat 


Total touchpoints have increased slightly 
to -500K (+3.8% y/y) 


■ Phone Touchpoints ■ Chat Touchpoints ■ Email Touchpoints 
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Touchpoints per customer remain flat over 
the past 3 years 
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FIGURE 5.3a Your colleague's original slide 
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STEP 1 : What feedback would you give your colleague about the design of their 
slide related to attention to detail? Write down your thoughts. Focus on not only 
what you would recommend changing, but also why. Ground your feedback using 
design principles we have discussed. 

STEP 2: Take a step back and think about how the data is designed: stacked 
bars on the left, table on the right, and additional numbers in the text. Are there 
changes you would make to the way this data is shown? How might you design 
the data in a way that is more intuitive for our audience? Write down your ideas. 

STEP 3: Download the data and original visuals. Remake the slide, incorporating 
your feedback and ideas in the tool of your choice. 
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Solution 5.3: pay attention to detail & design intuitively 

STEP 1: Fi rst, let me say that attention to detail is hugely important in our visual 
designs. Typically, the graph or the slide is the only part of the analytical process 
that our audience actually sees. Whether they should or not, people tend to as¬ 
sume things about the overall level of detail that was paid based on this piece that 
they can directly observe. So make your visuals and the pages that contain them 
imply good things about your overall work! 

Related to attention to detail, I would concentrate my feedback on three areas: 
consistency, alignment, and intuitive axis labels. Let's review each of these. 

Consistency is an important aspect when it comes to attention to detail: be consis¬ 
tent in your approach unless it makes sense for some reason not to be. Changing 
design elements up randomly or otherwise introducing unnecessary inconsistency 
can be attention grabbing, distracting and looks sloppy. Specific things that catch 
my eye in this case are: inconsistent decimal points on y-axis labels of graph and 
in the bottom Email cell of the table. Also the way the dates are shown is incon¬ 
sistent between the graph and the table, and not even consistent within the table! 

When it comes to alignment, as we've discussed, centered text often looks messy. 
When it flows onto multiple lines, it creates jagged edges, as we see in the cen¬ 
ter-aligned statements above the graph and table. While I might preserve the cen¬ 
tering of numbers in the table (if I were to keep the table; more on that shortly), I 
would be consistent in the vertical alignment. I would center consistently in that 
direction as well (currently the dates in the table are top aligned, while the num¬ 
bers are center aligned vertically). Also, the overall elements on the page could be 
aligned a little better—the table isn't directly under the line above it and the orange 
box on the far right could be sized to better fit the cells it's meant to highlight. 

My final main point of feedback on the current design would be in regards to 
intuitive axis labels in the graph. Currently, every fifth month is labeled on the 
x-axis. We can see why this was done: there isn't sufficient space to label every 
point, particularly given the long format of the dates. One method is to label 
only some, though we should be thoughtful with what frequency we choose to 
label. Choose a frequency that will be intuitive based on the data being shown. 
For example, every seventh point labeled would make sense for daily data (since 
there are seven days in a week) or it could make sense to label by weeks instead of 
days. For monthly data, every third or sixth month would be more intuitive. If you 
have limited space with time on the x-axis, you could label by quarters or years. 
We could pull the years out as a supercategory and either abbreviate the months 
and arrange the text vertically, or just use the first letter of each month to maintain 
horizontal text. I'll employ this latter method in my solution. There isn't a single or 
preferred approach: choose axis labels that will be intuitive, helpful, and legible 
for your audience. 
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As additional points of feedback, I'd reduce redundancy by removing "Touch- 
points" from each category label in the graph and also label the data directly 
so my audience doesn't have to go back and forth between the legend and the 
graph to decipher the data. Color is also clearly something we can play with here, 
but I'll reserve that for when I consider the overall design momentarily. 

Figure 5.3b illustrates what my remake of the graph would look like incorporating 
the changes I've outlined. 
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FIGURE 5.3b Redesigned graph with greater attention to detail 


STEP 2: When it comes to stepping back and designing the data in a way that 
makes sense, there are more sweeping changes I would recommend. Let's shift 
next to how we might design the data to make sense. 


Going back to Figure 5.3a, there are a lot of numbers between those called out 
in the titles, those added to the graph, and those in the table. We don't need all 
of these. Let's talk first about total number of touchpoints. This is referred to in 
the title and through text and numbers that have been added to the graph. If this 
information is critical, I could break it out on a separate slide and graph it (and 
would probably include more data than simply the two yearly numbers that are 
mentioned currently). Otherwise, I'd be apt to include the additional context as a 
sentence rather than clutter my graph with it. 
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Turning our attention to the table: this doesn't add any new information. The data 
shown there is already graphed in the January points in the graph on the left. So 
rather than break it out separately, if these specific numbers are of interest, I'd 
recommend putting them on the graph directly with the data. In this case, I don't 
think these numbers are critical. If we step back and think about the story, that will 
lead us to look at different views of the data, both to get a better understanding 
of where we want to focus and the story we can tell, as well as to figure out how 
to make that clear and easy for our audience. 

Let's focus on other ways we could visualize the data. One challenge with stacked 
bars is that we can really only compare the first data series at the bottom of the 
stack and the total (overall height of bars) with ease. If anything interesting is 
happening in a data series up the stack, it becomes quite difficult to see because 
those pieces are stacked on top of other pieces that are also changing. To allow 
for both of these comparisons with greater ease, 1 could unstack the bars and turn 
them into lines: see Figure 5.3c. 
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FIGURE 5.3c Graph the data as lines 

In Figure 5.3c, I unstacked the categories and graphed each type of touchpoint— 
Email, Phone, Chat—as lines. I added an additional line representing the Total. 

I also stripped color out of the graph entirely, so we can look at all of the data 
critically and determine where it might make sense to focus. We'll add some color 
back in a later step. 

When I look at this data, what jumps out at me—even more than with the stacked 
bars—is the apparent seasonality. When we want to clearly see seasonality (or in 
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some cases, a lack of seasonality), it can work well to use a single year of months— 
for example, from January to December—for our x-axis, with a different line for 
each year. This change will result in a lot of lines if we do it for every category. With 
different data, we may need to split it into multiple graphs. However, here, given 
the spread of the data, we can make it work in a single graph. See Figure 5.4d. 

Touchpoints per customer over time 
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FIGURE 5.3d Change x-axis to monthly calendar year to better see seasonality 

In Figure 5.3d, I've changed the x-axis to January through December, plotting each 
year as its own line. Within each color grouping, the thin line represents 2018, the 
thick line represents 2019, and the circle points at the left represent our single 
month of data—January—for 2020. Notice that we see pretty consistent seasonality 
in Total touchpoints, with higher touchpoints per customer in January and Decem¬ 
ber and relatively lower through the rest of the year. Don't worry if you aren't loving 
this graph—it's an interim step to help get us to where we're going next. 

I'm going to assume that we're standing in February 2020, since the most recent 
data point is January 2020. Given this, plus the shape of the data over the course 
of the year (higher at beginning and end, as mentioned, and lower in the middle), 

I am going to adjust my x-axis. Rather than the typical calendar year (January to 
December), I will change it to go from July to June to make it easier to see how re¬ 
cent months have compared year-over-year. In doing this, I'll also eliminate some 
data, solve for the awkward single data points in 2020, and simplify my lines to 
"This Year" and "Last Year." See Figure 5.3e. 
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Touchpoints per customer over time 
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FIGURE 5.3e Change x-axis to run from July to June 


With this view, I can make a couple of observations that didn't jump out at me 
before. First, let's pause on the Total: we see this year's trend has followed last 
year's closely. However, January touchpoints per customer are lower than last year. 
Moving downward, we see both Email and Phone touchpoints are trending lower 
this year compared to last year. Chat touchpoints, on the other hand, illustrate 
something different: Chat touchpoints have been consistently higher this year 
compared to last, with that difference increasing in January. 

You may notice the varying decimal places on the labels in Figure 5.3e. I chose to 
round to one point past the decimal for Total and Email given the magnitude of 
the numbers. I took it out to an additional place past the decimal for Phone and 
Chat, both so that we can evaluate the small but potentially meaningful difference 
and so two points of varying heights wouldn't be labeled with the same number 
(in this case 0.3), which could cause confusion. 
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STEP 3: Pulling this all together and putting words back around it, my final slide 
might look something like Figure 5.3f. 

Total touchpoints flat, shift toward chat 

There is clear seasonality to customer touchpoints, which peak in January. 

While email and phone are down year-over-year, chat touchpoints have increased. 

LET’S DISCUSS: How should this inform go-forward strategy and goals? 


OVERALL recent months have followed 
last year's trend closely, with slightly lower 
touchpoints per customer as of Jan. 

EMAIL continues to make up the highest 
volume of touchpoints, though as of Jan is 
slightly lower than last year (0.50 vs. 0.58). 

PHONE at 0.34 touchpoints per customer 
also decreased year-over-year (0.45 at 
same time last year). 

CHAT touchpoints have increased steadily 
in recent months. While only 0.26, this 
accounts for an increasing proportion of 
total and reflects nearly doubling year- 
over-year. Add more context here: whether 
this is desired, expected to continue, etc. 

FIGURE 5.3f My redesigned slide 

If I were talking through this information in a live setting, my slides would focus 
on the graph and I would build it piece by piece (we’ll look at examples of this in 
Chapters 6 and 7). However, if I have a single slide to get the information across—• 
perhaps this is a slide that's being incorporated into a broader deck that will be 
sent around—then I want to put all of the words around it so it makes sense. The 
words I've added are mostly descriptive; ideally we'd use this annotation to lend 
additional context, provide framing of whether what we are seeing is good, ex¬ 
pected, and so on. I tied words to the data they describe through similarity of col¬ 
or. The result: when my audience reads the words, they know where to look for ev¬ 
idence in the data and vice versa. I used sparing color, relative size, and position 
on the page to create visual hierarchy and help make the information scannable. 


Touchpoints per customer over time 
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By being thoughtful in all aspects of our design, we can make our data more easily 
consumable for our audience, helping ensure that our message comes across clearly. 
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Exercise 5.4: design in style 

Something we haven't touched upon yet that can influence our design style when 
communicating with data is brand. Companies often go through great amounts 
of time and expense to create their branding: logos, colors, fonts, templates, and 
related style guidelines. Beyond being required to use this, there can be value in 
rolling branding into how you visualize data: it helps create a cohesive look and 
feel and can even add some personality into your data communications. Let's 
practice applying branding to a graph! 

We originally looked at the following graph in Exercise 3.1. Figure 5.4a shows 
market size over time for a given product. The storytelling with data typical look 
and feel has been applied. The font is Arial. Titles have been justified at upper left. 
Axis titles are in all caps. Most elements are in grey except sparing use of color to 
direct attention (orange for a negative callout and associated data point, brand 
blue for positive data point and corresponding comment). 
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2019: The year started at less than $1.6B, but 
increased markedly in February, when a new 
study was released. Total sales have increased 
steadily since then and this is projected to continue. 
The latest forecast is for S2.4B in monthly sales by 
the end of the year. 

S2.4B 



S1.6B 


^JO 

.0B 


2019 FORECAST 


ind ba 


$ 0.0 


J | F MAMJjJ jASON D J F M A M J | J A S O N D 
2018 ! 2019 


FIGURE 5.4a Graph with storytelling with data branding 

Download the data and graph then complete the following. 


STEP 1: Imagine you work for a brand similar to United Airlines and need to pull 
together an annual report that involves looking at market size. Start by doing 
some research: visit United's website, search Google images, and browse related 
pics. Write down 10 adjectives that describe the brand. Recreate Figure 5.4a, 
rebranding with a style similar to United Airlines. Reflect on how this affects your 
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choice of colors and font. How else might this brand influence changes in the 
design of this graph? 

STEP 2: Let's do this a second time. In this instance, you are an analyst at Coca 
Cola. Repeat the exercise, first by doing some research and making a list of words 
or feelings you'd associate with the brand. Then recreate this graph again, re¬ 
branding based on your research. What changes did you make to achieve this? 
How does red as a brand color play into your design? 
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Solution 5.4: design in style 

STEP 1: Words that come to mind when I look at the United Airlines website and 
search Google for related images include: clean, classic, bold, blue, navigable, 
open, minimal, simple, serious, and structured. The logo has an intense dark blue 
background, with center-aligned, bold, white, capital letter text and sparing use 
of a lighter, more muted shade of blue. 1 can incorporate these feelings and ele¬ 
ments into my design of the graph. See Figure 5.4b. 
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The year started at less than $ 1.6B, but 
increased markedly in February, when a new 
study was released. Total sales have increased 
steadily since then and this is projected to 
continue. The latest forecast is for $2.4B in 
monthly sales by the end of the year. 
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2019 


2019 forecast provided by ABC consultants and based on market data through June. 
The forecast assumes no major market changes. 


2018 

Jan-Jun was a period of stability, with fairly 
steady growth (averaging +3% per month). 
There was a nearly 20% decrease in July, 
when Product X was recalled and pulled from 
the market. Total sales remained at reduced 
volume for the rest of the year. 



FIGURE 5.4b Branding inspired by United Airlines 

My main initial changes were to color and font. I used the dark and light blues 
throughout, with the exception of the graph axes: choosing black for axis titles 
and labels and grey for axis lines. The font I chose (Gill Sans) takes up a bit more 
space than Arial. This looked overly crowded with the text boxes above the data 
line. To remedy this, I moved the text boxes below the data and also reduced the 
y-axis maximum to shift the line upward, creating room below it to reposition the 
text boxes. I positioned the footnote below the graph. 


I center-aligned most of the text (I played with left and right alignment of the 
large text boxes, and while I liked the structure of the clean edge that created, 
something about it didn't feel fitting with the rest of the graph). The United logo 
and brand connote a feeling of clear organization to me, so I manifested that here 
by adding blue rectangles behind the title and footnote and also a blue border 
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around the graph. I thickened the data line because I like how this balances out 
the bold title text. Even though the primary brand color is blue (similar to SWD), 
this rebranded graph feels quite different than the original Figure 5.4a as a result 
of these changes. 


STEP 2: Next, let's be inspired by the Coca Cola brand. I reviewed can and bot¬ 
tle labels, logos, and advertisements. Words I would associate with this brand 
include: red, silver, round, classic, bold, sweet, playful, international, diverse, and 
wet (there's often condensation shown on the cans!). I observe a heavy use of red 
backgrounds, contrasting white text and sparing use of black. Text is typically 
center-aligned and frequently features a combination of bold all caps surrounded 
by slightly smaller non-bold all cap text. Words are used minimally. I'll fold these 
components into my redesign. See Figure 5.4c. 
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FIGURE 5.4c Branding inspired by Coca Cola 

One aspect of the Coca Cola brand that I chose not to incorporate is the cur¬ 
sive-like text in the Coca Cola logo. While this is fine for a logo, my priority for text 
related to the graph is legibility. 

Text should be large enough to read and in a font that is easy to read. I opted 
for a sans serif font similar to the supporting text I saw on can and bottle labels 
(Montserrat, a free font that I downloaded). To incorporate some of the round 
feel that you get from the logo, I opted for a rounded (rather than rectangular) 
background shape. 
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Speaking of the background, the red background in Figure 5.4c is quite bold. This 
might be fine if it is the only graph we are looking at, or if graphs will be projected 
one by one on slides. If there will be multiple graphs on a single page or if I an¬ 
ticipate that my audience will want to print it, I may opt for a lighter "Diet Coke" 
version. See Figure 5.4d. 
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FIGURE 5.4d Less ink-heavy background 

In Figure 5.4d, I opted for a light grey background, similar to the silver I saw in¬ 
corporated into some of Coca Cola's designs. With this lighter background, black 
stands out more, so I opted for a few more black elements compared to the 
original remake. I can use white, which fades to the background on grey (whereas 
it stood out a lot against red) for elements such as axis lines. I limited my use of 
brand red to the graph title and data. 

Red as a brand color works well with grey and sparing use of black, and looks 
quite slick as we see in Figure 5.4d. When it comes to colors, there is a tendency 
to use red and green to denote bad and good or negative and positive, respec¬ 
tively. While I recommend against this due to considerations for colorblindness, 1 
especially discourage it for organizations having red as a brand color. You want 
positive things associated with your brand, so if your brand color is red, don't 
associate red with negative or bad things. One alternative in this circumstance 
can be to use red for good and black for bad. In the preceding graph, I've used 
red for general data and black for call outs (without connotation of bad or good), 
which is another option. 
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Stepping back and summing up: there can be value from rolling branding into 
how you communicate with data. If you work with client organizations, consider 
how you can undertake research similar to what we've done here and integrate 
your learnings into your designs. When it comes to you own organization's brand, 
many companies have style guides that you can use to better understand the 
brand and what options you may have. Regard these not as annoying constraints, 
but rather as a lodestar that can inspire creativity and cohesiveness across your 
data communications. 
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Exercise 5.5: examine & emulate 


One piece of advice I often give is to simply observe the examples of data visu¬ 
alization you encounter in the world around you. Pause to reflect: for the good 
ones, what works well that you can emulate in your own work? For the not-so- 
good ones, identify what pitfalls the creator fell into that you can avoid. Let's do 
an exercise when it comes to the effective side of things. 

Rather than simply pause and figure out what works well, we can go a step further 
and take the time to emulate the effective examples we identify, recreating them 
and learning how to achieve the aspects of effective designs in our tools. The lev¬ 
el of attention to detail this process forces can help us be more thoughtful in our 
own work and sharpen our visual design skills and style. Let's practice all of this! 

First, identify a visual (graph or slide) someone else created that you believe is ef¬ 
fective. This could be an example from a colleague at work, the media, storytelling- 
withdata.com, or elsewhere. After you've chosen an example, tackle the following. 

STEP 1: Consider the four aspects of design we've discussed: (1) affordances, (2) 
aesthetics, (3) accessibility, and (4) acceptance. Judging from the visual you've 
chosen and making assumptions as needed for the purpose of the exercise—how 
did the creator account for each of these areas through the choices they made in 
their design? Write a few sentences describing how each of these four aspects of 
design were achieved. 

STEP 2: Stepping back, why is it that the example you've chosen is effective? Are 
there specific elements of thoughtful design that make it work that you haven't al¬ 
ready described? How might you generally apply these learnings to your own work? 

STEP 3: Is there anything about the example you've chosen that you believe is 
not ideal or that you would have done differently? Write a couple of sentences 
outlining your thoughts. 

STEP 4: Recreate the visual you've identified in the tool of your choice. First, work 
to emulate it as closely as you can when it comes to the specifics (typography, 
color, and overall style). 
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STEP 5: Make another version that incorporates any of the aspects you outlined 
in Step 3 that you would have approached differently. Look at your visuals from 
Step 4 and Step 5 side by side. Which do you prefer and why? 


Exercise 5.6: make minor changes for major impact 

It's frequently a lot of little things that work together to create a great or not-so- 
great experience for our audience in the data communications we design. This 
means that small changes can have big impact in improving our visual designs. 
Let's look at an example and also practice how these modifications can add up to 
help us take work from acceptable to exceptional. 

Let's say you work at an advertising agency and have been asked to assess a 
recent six-week ad campaign for a client. The data you are focusing on is incre¬ 
mental reach, which you measure "per 1,000 impressions." You have a colleague 
who did a similar analysis for a different client recently, so rather than start from 
scratch, you've updated her visuals with your data as a starting point. Next, you 
want to edit and refine. 


Figure 5.6 shows the visual you've created. Spend a couple of minutes to familiar¬ 
ize yourself with the details, then complete the following. 

Incremental Reach per 1,000 Impressions 

Digital platforms proved successful at reaching new viewers later in the campaign that were not exposed to TV ads. 
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FIGURE 5.6 Your original slide 
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STEP 1: Pause first to consider what is working well. What do you like about the 
current view of the data? 

STEP 2: A number of steps have been taken in Figure 5.6 to direct attention and 
help explain. Which are working well? Where and how might you adjust? 

STEP 3: What clutter would you eliminate? What elements would you push to the 
background? 

STEP 4: What other design choices made here do you question given the lessons 
in this chapter? What additional changes would you make? 

STEP 5: Download the data and current graphs. Refine the visual by making the 
changes you've outlined in the steps above using the tool of your choice. 


Exercise 5.7: how could we improve this? 

Imagine you work for the same on-demand printing company that we assumed 
in Exercise 5.3 when we looked at customer touchpoints data. How your compa¬ 
ny interacts with customers is one possibly interesting topic, as we saw. Another 
might be the competitive landscape for your products. As part of this latter area 
of focus, your colleague has been asked to pull together some data on your main 
competitors' market share overtime. 

He comes to you with his slide—Figure 5.7—and asks for feedback. 

Study Figure 5.7, then complete the following. 
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Top competitors remain present, with an 
increase in use of XBX Business 


30% 


■ 2016 >2017 ■ 2018 >2019 



• Reduction in in-house solutions 
continues: likely because of the 
shifting mix of our customer base 

• PrintPresse gains importance— 

it’s becoming a bigger threat due to 
the nature of its offering (integrated, 
high quality) 

• XBX’s business-centric offering 
is gaining ground, while 
Pring4Cheap and CustomPrint 
show a decrease. At the same time, 
XBX has seen slippage in their 
standard offering. 


FIGURE 5.7 How could we improve this? 


STEP 1: List 5 design improvements you would recommend making to this slide. 
Articulate not only what, but also why. How specifically will your ideas improve 
the design? 


STEP 2: Download the data and execute the changes you've outlined in the tool 
of your choice. 


STEP 3: Consider how you would present this material in a live meeting com¬ 
pared to something that has to be sent around as a stand-alone document. How 
would your approach change between these two instances? Write a few sentenc¬ 
es to explain. 


Exercise 5.8: brand this! 

As we explored in exercise 5.4, there are ways that we can incorporate company 
or personal brand into how we communicate with data. This can be facilitated 
through choice of font, color, and other elements. In some cases, it may mean 
incorporating a logo or using a customized slide or graph template. Let's practice 
how you can incorporate branding in a graph. 

Suppose you work for a pet food manufacturing company. Look at the following 
graph, Figure 5.8, which depicts relative cat food sales over time (expressed in 
terms of % of total) for a given brand line, Lifestyle. Complete the following. 
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Lifestyle brand sales: Natural making up increasing proportion 

Lifestyle 

Diet Lifestyle 

Lifestyle Plus 

Lifestyle Natural 
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FIGURE 5.8 Brand this! 
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STEP 1: Identify two recognizable brands. They don't have to be at all relevant 
to this example—these could be company brands or sports teams, for instance. 
It will be more fun and a better exercise if you pick two that are quite different 
from each other in terms of style. Research images related to the brand and list 10 
adjectives that describe the look and feel of each. Remake this visual two times, 
incorporating branding components of each of these, respectively. 

STEP 2: Take a step back and compare the two visuals you've created. How does 
each fee/? Were you successful bringing to life the adjectives you outlined in Step 
1? How can branding affect how we communicate with data generally? What are 
some pros and cons of this? Write a few sentences with your thoughts. 

STEP 3: Consider your company or school's brand. What descriptors would you 
associate with it? Remake the graph again, styling it accordingly. To take it a step 
further, integrate your branded graph into a slide, applying consistent branding 
to any elements you add (title, text, logos, and colors). 

STEP 4: How would you generalize the components of brand we should think 
about when we visualize and communicate with data? What are the benefits of 
doing so? Are there scenarios where we may not want to be consistent with brand 
in our data communications? Write a few sentences outlining your thoughts. 
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Exercise 5.9: make data accessible with words 

When you look at a graph you made, it's likely you know what you're looking at: 
what to pay attention to, how to interpret it, and what to take away. But as we've 
discussed, this isn't necessarily clear to our audience in the same way. Words used 
well can be a strategic tool for making our data comprehensible for our audience, 
answering questions before they arise, and helping them to draw the same con¬ 
clusion that you have. 


Words make data accessible! 


Recommendation: use text to highlight key points 

Graph title 



AXIS TITLE 

Any time you show data, you should have a footnote with the data source, as of date, and necessary assumptions and/or 
methodology (given your situation and audience). 


FIGURE 5.9 Use words wisely 


There are some words that have to be present: every graph needs a title and 
every axis needs a title. Exceptions to this will be rare (for example, if your x-axis 
reflects months, you probably don't need to title it "months of the year"—you do, 
however, need to make it clear what year it is!). Make it your default to title axes 
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directly so your audience doesn't have to guess or make assumptions about that 
at which they are looking. Also don't assume that people looking at the same data 
are going to walk away with the same conclusion. If there is a conclusion you want 
your audience to draw—which there should be when using data for explanatory 
purposes—state that in words. Use what we know about preattentive attributes to 
make those words stand out: make them big, make them bold, and put them in 
high priority places such as the top of the page. 

Speaking of which'—the top of the page (in Figure 5.9, "Words make data ac¬ 
cessible!") is precious real estate. It's the first thing your audience encounters 
when they see your page or screen. Too often, we use this precious real estate for 
descriptive titles. Instead, use this for an active title; put your key takeaway there 
so your audience doesn't miss it. This also works to set up what will follow on the 
rest of the page. (We'll further explore and practice takeaway titling in Chapter 6.) 

Also consider what is helpful to have present but doesn't necessarily need to draw 
attention. For example, when showing data, it is often useful to have a footnote 
that lists details such as the data source, the time period represented (or time at 
which the data was extracted), assumptions, or methodology details. These are 
things that can help your audience interpret the data and lend credibility, as well 
as give you a reference in the event you need to replicate and create something 
similar in the future. It's important, but doesn't need to compete with other things 
for attention. This text can be smaller, grey, and in lower-priority places on the 
page, like the bottom. 

After you've created your graph or slide, run through the following questions to 
help ensure you are using words wisely: 

• What is the key takeaway? Have you stated it in words prominently so your 
audience doesn't miss it? 

• Does your graph have a title? Is it descriptive enough to set the right expec¬ 
tation for your audience when looking at the data? 

• Are all axes labeled and titled directly? If not, what steps have you taken to 
make it clear to your audience? 

• Do you have a footnote listing details that are important, but don't need to 
take main stage? If not, should you? 

• Stepping back: does this seem like an appropriate amount of words given 
how you'll be communicating to your audience? Typically, you'll have fewer 
words on a slide for something you'll be presenting live and more words for 
something that is being sent around and has to stand on its own. Does your 
level of words in the given case match how the data will be communicated? 
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Exercise 5.10: create visual hierarchy 

Affordances are aspects of our visual design that help our audience understand 
how to interact with the data we are communicating. We can draw attention to 
some components and push others to the background to create visual hierarchy 
and make our communications scannable. Want a quick test to see if you've done 
this well? Squint your eyes to see the overall impression of the chart. This changes 
your perception enough to get fresh eyes on a design. The most important ele¬ 
ments should be the first things you see and the most prominent. 

For more specific tips on how to achieve visual hierarchy, read through the fol¬ 
lowing from SWD (paraphrased from Lidwell, Holden, and Butler's Universal Prin¬ 
ciples of Design) for highlighting the important stuff and eliminating distractions. 
Determine how you can apply these to your next project! 

Highlight the important stuff 


• Bold, italics, and underlining : Use for titles, labels, captions, and short word 
sequences to differentiate elements. Bold is generally preferred over italics 
and underlining because it adds minimal noise to the design while clearly 
highlighting chosen elements. Italics add minimal noise, but also don't stand 
out as much and are less legible. Underlining adds noise and compromises 
legibility, so should be used sparingly (if at all). 

• CASE and typeface: Uppercase text in short word sequences is easily scanned, 
which can work well when applied to titles, labels, and keywords. Avoid using 
different fonts as a highlighting technique, as it's difficult to attain a notice¬ 
able difference without disrupting aesthetics. 


Color is an effective highlighting technique when used sparingly and general¬ 
ly in concert with other highlighting techniques (for example, bold). 


Inversing elements 


is effective at attracting attention, but can add consider¬ 
able noise to a design so should be used sparingly. 

Size is another way to attract attention and signal importance. 


Eliminate distractions 

• Not all data are equally important. Use your space and audience's attention 
wisely by getting rid of noncritical data or components. 

• When detail isn't needed, summarize. You should be familiar with all the de¬ 
tails, but that doesn't mean your audience needs to be. Consider whether 
summarizing makes sense. 
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• Ask yourself: would eliminating this change anything? No? Take it out! Resist 
the temptation to keep things because you worked hard to create them; if they 
don't support the message, they don't serve the purpose of the communication. 

• Push necessary, but non-message-impacting items to the background. 

Use your knowledge of preattentive attributes to de-emphasize supporting 
details. Grey works well for this. 


Exercise 5.11: pay attention to detail! 

Many elements add up to create the overall experience our audience feels when 
faced with the visuals we create. Have you ever noticed how some designs feel 
easy and elegant, while others feel clunky and complicated? Paying close atten¬ 
tion to details can help ensure the visuals we create are met with happiness by 
our audience. Here are some specific aspects of your visual design to consider to 
achieve this—the next time you create a graph or slide, read through and apply 
the following. 

• Use correct spelling, grammar, punctuation, and math. This should go with¬ 
out saying, but I encounter examples regularly where there are issues of this 
sort. When it comes to misspellings, this is an excellent reason to get a sec¬ 
ond set of eyes on your work, soliciting feedback from someone else. Our 
brains actually fix errors in our work so that you might not even catch a mis¬ 
take you've made! (Unfortunately, that innocent oversight may end up being 
the unintended focus of your audience's attention.) A trick I once heard for 
spell-checking your own work is to read it backwards: you can't skim when you 
do this and so it's easier to identify mistakes. Or you can put it in a really ugly 
font, which has a similar effect. Also if you show math, make sure it is correct— 
there's no bigger credibility-killer than math that doesn't add up! 

• Precisely align elements. As much as possible, aim to create clean vertical and 
horizontal structure across all elements (avoid diagonal, which looks messy, is 
attention grabbing, and slower to read in the case of text). Use table structure 
or turn on gridlines or rulers in your tool to precisely line things up. As I've 
mentioned, I'm a fan of upper-left-most justifying graph titles and axis titles. 
This creates nice framing for the graph (particularly with all cap axis titles, 
which form clean rectangles compared to mixed case). Also, given the typical 
zigzagging "z" of processing, this positioning means your audience hits how 
to read the data before they get to the actual data. Bonus! 

• Use white space strategically. Don't fear white space or fill it just because it's 
there. White space helps make the things that aren't white space stand out. 
Use white space to set things apart. Paired with good alignment, this can help 
you create organized structure in your graph or on your page. 
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• Visually tie related things together. When someone looks at the data, make 
it clear where to look in accompanying text for related info. When they read 
text, make it clear where they should look in the data for evidence of what's 
being said. Think back to the Gestalt principles that we covered in Chapter 3 
for methods to visually tie elements together; specifically, turn back to Exer¬ 
cise and Solution 3.2 for an illustration. 

• Maintain consistency when it makes sense. When things are different, people 
wonder why. Don't make your audience use their brainpower for this unneces¬ 
sarily. If it makes sense to graph things in a similar manner, do so. If you use a 
specific color to direct attention in one place, keep this consistent elsewhere 
unless you have a good reason to change it. 

• Observe the overall "feel" of your visual. Step back and consider: how does 
the visual you've created feel to look at? Is it heavy or complicated? How can 
you ease this? If unsure, get feedback from someone else—ask them for ad¬ 
jectives they would use to describe your work and refine as needed. 


Exercise 5.12: design more accessibly 

The following is adapted from Amy Cesal's guest post on the SWD blog; you can 
read her full article, which includes a number of examples and links to additional 
resources, at storytellingwithdata.com under the title "accessible data viz is better 
data viz." 

Often, when we are creating charts and graphs, we think of ourselves as the ideal 
user. This is not only a problem because we know more about the data than the 
target user but also because other users might have a different set of constraints 
than we do. 

Inclusive design principles and accessibility are important to take into consider¬ 
ation when designing data visualization because they help a broader audience 
understand your graphic. Designing with accessibility in mind can even help make 
your visualizations easier to understand for people without disabilities. 

Being clear with text, distinctive labeling, and adding multiple ways to identify 
the point to your visuals will make it easier for people with impairments and those 
without to interpret your graphs. There are easy ways to add the principles of ac¬ 
cessibility into your visual communications. Here are five simple ones: 

1. Add alt text. Alternative text (referred to as alt text) is displayed when the 
image cannot be. Screen readers, the assistive technology used by people 
who are visually impaired, read alt text out loud in place of people seeing 
the image. It's important to have valuable alt text instead of "figure-13.jpg," 
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which doesn't help a user understand the content they are missing. Screen 
readers speak alt text without allowing users to speed up or skip, so make 
sure the information is descriptive but succinct. Good alt text includes one 
sentence of what the chart is, including the chart type for users with limited 
vision who may only see part of it. It should also include a link to a CSV or 
other machine-readable data format so people with impaired vision can tab 
through the chart data with a screen reader. 

2. Employ a takeaway title. Research suggests that users read the title of the 
graph first. People also tend to just rephrase the title of the graph when 
asked to interpret the meaning of the visualization. When the graph title 
includes the point, the cognitive load of understanding the chart decreases. 
People know what to look for in the data when they read the graph takeaway 
first as part of the title. 

3. Label data directly. Another way to reduce cognitive burden on users is to 
directly label your data rather than using legends. This is especially useful for 
colorblind or visually impaired users who may have difficulty matching colors 
within the plot to those in the legend. It also decreases the work of scanning 
back and forth trying to match the legend with the data. 

4. Check type and color contrast. Colorblindness is an issue for 8% of men and 
0.5% of women with Northern European ancestry. However, we should also 
consider users with low vision and a variety of other conditions that affect 
vision. The Web Content Accessibility Guidelines (www.w3.org) specify nec¬ 
essary contrast and text sizes for readability on screen. There are a number 
of tools to help you abide by these contrast and size standards, for example, 
the Color Palette Accessibility Evaluator. 

5. Use white space. White space is your friend. When information is too densely 
packed, the graphic can feel overwhelming and unreadable. It can be helpful 
to leave a gap between sections of a chart (for example, outlining the sections 
ofa stacked bar in white). Judicious use of white space increases the legibility by 
helping to demarcate and distinguish between different sections without rely¬ 
ing only on color. This can also supplement accessible color choices by helping 
users distinguish the difference between colors that identify separate sections. 

These are just a few things you can do to help everyone easily comprehend the 
graphs that you create. You should strive to make sure that everyone—not just 
you or your ideal user—understands the point of the visualization. When you con¬ 
sider accessibility, you create a better product for all. 

The next time you need to communicate with data, refer to and apply these tips! 
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Exercise 5.13: garner acceptance for your designs 

People dislike change. This is a simple fact of human nature. In the scenario where 
we've always shown data in a certain way and people are attached to it—how do 
we convince them to do things differently? What should we do in general when 
met with resistance from our audience? 

This is a change management process. In the same way that we considered our 
audience in the exercises in Chapter 1 and tried to understand what motivates 
them, we can do that here as well: in this situation our audience becomes those 
whose behavior we want to influence. First and foremost, when we want to con¬ 
vince our audience to be open to our designs, we need to do it in a way that 
works for them. 

The wrong way to go about changing their minds sounds something like this, "I 
just read this book, and I learned that we've been doing it wrong; we should really 
be looking at it like this." That might be easy, but it's not so compelling or inspir¬ 
ing. So unless you're the boss and people have to do what you say (even if that 
is the case, you should probably be more subtle in your approach!), you have to 
work to influence your stakeholders or colleagues to change. 

Here are a few strategies from SWD —plus a couple of new ideas-—that you can 
leverage for gaining acceptance in the design of your data visualization. 

• Articulate the benefits of the new or different design. Sometimes simply giv¬ 
ing people transparency into why things will look different going forward can 
help them feel more comfortable. Are there new or improved observations 
you can make by looking at the data in a different way? Or other benefits you 
can articulate to help convince your audience to be open to the change? 

• Show the side-by-side. If a different approach is clearly superior to the way 
things have been done, showing them next to each other will demonstrate this. 
Couple this with the prior suggestion by showing the before-and-after and ex¬ 
plaining why you want to shift the way you are looking at things. 

• Provide multiple options and seek input. Rather than prescribing the design, 
create several options and get feedback from colleagues or your audience 
(if appropriate) to determine which design will best meet the given needs. 
Involve stakeholders in the process—they'll be more bought into the solution 
as a result. 

• Get a vocal member of your audience on board. Identify influential members 
of your audience and talk to them one-on-one in an effort to gain acceptance 
of your design. Ask for their feedback and incorporate it. Identify champi¬ 
ons—people outside of your team who support what you want to do and can 
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help influence others. If you can get one or a couple of vocal members of your 
audience or their peers bought in, others may follow. 

• Start with the familiar and transition from there. This can be a particularly 
effective strategy in a live setting. Begin with the view that your audience is 
used to seeing, then pivot to a different one, making it clear how this ties back 
to the original and highlighting what the new visual allows you to see, or how 
it can help frame the conversation in a new way. When a graph is done well, 
you'll often find that you don't have to spend a lot of time talking about the 
graph but rather can spend it discussing what the data shows. This can shift 
the overall conversation in a really helpful way. 

• Don't replace—augment. As an interim step, rather than change anything, 
leave it all as it is. Add to this with your new view(s). For example, rather than 
redesign your regular report, keep it the same. Integrate a couple of slides up 
front or add content to the email that distributes it, applying best practices 
in these places. If done well, this is like saying to your audience, "We haven't 
changed anything—the data is all there and we are happy to go through it with 
you, but we've already taken the time to do that and here (up front, applying 
the lessons covered throughout SWD and this book) are the things you should 
focus on this time." As your audience gains confidence in your ability to hone in 
on the right things in effective ways, you can wean dependence on all the data 
and potentially reduce what you share with your audience over time. 

Reflect on whether any of the above can be employed in your situation to help 
you drive the change that you seek and the acceptance of your visual designs. In 
general, think about how you can set yourself up for success. Getting to know your 
audience—those you want to influence to accept your design—and what drives 
their behaviors can help. Think about not why you think they should change, but 
why they should want to. Make your approach work first and foremost for them. 
Refer back to Chapter 1 for exercises that will help you get to know your audience. 

Also consider whether it's a fight worth fighting. Don't start with big battles. Start 
with low-hanging fruit and achieve small victories. Over time, you'll build credi¬ 
bility so if and when you do want to make more sweeping changes, you'll have 
earned your colleagues' and audience's respect and hopefully have an easier time 
making it happen! 
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Exercise 5.14: let's discuss 

Consider the following questions related to Chapter 5 lessons and exercises. Dis¬ 
cuss with a partner or group. 

1. What role do words play in making our data visualizations comprehendible? 
What kind of text should be present in every graph? Are there any excep¬ 
tions to this? 

2. When creating visual hierarchy in our designs, it's important both to highlight 
the important stuff and to de-emphasize some aspects. Which elements of 
our graphs and slides are good candidates for de-emphasizing? How can we 
visually push things to the background? 

3. How would you describe thoughtful design when it comes to data visualization? 

4. What does accessibility mean when it comes to communicating with data? 
What steps can we take to make our designs more accessible? 

5. Is it worthwhile to take the time to make our graphs pretty? Why or why not? 

6. How does personal or company brand come into play when communicating 
with data? What are some advantages of this? Are there any disadvantages? 

7. Have you ever wanted to make a change to a graph or the way that you 
visualize data and been met with resistance? What did you do? Were you 
successful? What strategies can we use to influence our audience in general 
when this happens? What will you do the next time you face this situation? 

8. What is one specific goal you will set for yourself or your team related to the 
strategies outlined in this chapter? How can you hold yourself (or your team) 
accountable to this? Who will you turn to for feedback? 


PRACTICE aA WORK 



chapter six 


tell a story 

Data in a spreadsheet or facts on a slide aren't things that naturally stick with 
US'—they are easily forgotten. Stories, on the other hand, are memorable. Pairing 
the potency of story with effective visuals means that our audience can recall what 
they heard or read in addition to what they saw. This is powerful, and we'll explore 
using story to communicate data in concrete ways in this chapter. 

As an aside, my order of lessons sometimes surprises people. Some elements of 
story relate to things that came up when we explored context in Chapter 1—why 
didn't we discuss story then? For me, this is the natural progression. Start with 
context, audience, and message. Time spent there will serve you well even if you 
don't take things full course and employ story. There's value in doing these things 
up front before you spend much time with your data: they can help you target 
your data visualization process and make it more efficient. But then after you've 
spent time with your data, know it well, and have identified what you can use it 
to help others see, it's time to look at the big picture again and figure out how to 
best communicate it to your audience. This is the precise moment story comes 
into play. 

Words, tension, and the narrative arc—these are components of story that we can 
use to get our audience's attention, build credibility, and inspire action. Not only 
are well told data stories memorable, but they can also be retold, empowering 
our audience to help spread our message. In this chapter, we'll undertake exercis¬ 
es that help highlight the importance of not just showing data, but making data a 
pivotal point in an overarching story. 

Let's practice telling a story! 

First, we'll review the main lessons from the relevant chapter in SWD Chapter 6. 
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Exercise 6.1: use takeaway titles 

As we illustrated through exercises in Chapter 5 (5.1 and 5.9), text plays an im¬ 
portant role when we communicate with data because words help make data 
understandable to our audience. Slide titles represent one important—and often 
underutilized—place to use words well. 

Picture a slide. Typically, there is a title at the top. This title space is precious 
real estate. It is the first thing our audience encounters when they see the page: 
whether projected on a a big screen or their computer monitor or printed on a 
piece of paper. Too often, we use this precious real estate for descriptive titles. 
Instead, I'm a fan of using action titles. If there is a key takeaway—which there 
should be—put it there, so your audience doesn't miss it! 

Studies have shown that effective titles can help improve both the memorability 
and recall of what is shown in a graph. Titling with the key takeaway also creates 
the right expectation for our audience: when we've done it well, it sets up what is 
to follow on the rest of the page. 

Let's practice forming takeaway titles and understanding how changing titles can 
direct our audience to focus on different aspects of our data. See Figure 6.1, 
which shows Net Promoter Score (NPS) for our business and our top competitors. 
NPS is a common metric used in voice of customer analytics. The higher the num¬ 
ber, the better. 
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FIGURE 6.1 What's the story? 

STEP 1: Create a takeaway title to answer the question posed at the top: "What's 
the story?" Write it down. What does the title encourage your audience to focus 
on in the graph? Write a sentence or two. 

STEP 2: Create a different takeaway title for this slide and repeat the other ac¬ 
tions from Step 1. 

STEP 3: Consider whether the takeaway titles you've created provide any senti¬ 
ment for your audience: do they tell your audience how to feel about this data? If 
so, how? If not, how might you retitle to convey a positive or negative message? 
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Solution 6.1: use takeaway titles 

What's the story? This is a question we sometimes ask when we don't mean story 
at all. Rather, we mean What's the point?, What's the takeaway?, or So what? For 
me, this is the minimum level of "story" that should exist any time we show data for 
explanatory purposes. We can use our title space to make our primary point clear. 

STEP 1: I could title this slide, "NPS is increasing over time." If I were to do so, my 
audience would simply read those words and then be primed to be looking for a 
line increasing upwards to the right. Upon seeing the graph, with attention drawn to 
Our Business, the words read in the title would be confirmed in the picture. 

STEP 2: As an alternative title, I might go with "NPS: we rank 4th among com¬ 
petitors." This prompts my reader to turn to the graph and start counting down 
the right-hand side... 1,2,3, yep—4th indeed. The words set a notion for what is 
to come in the graph and the graph reinforces the words in the title. 

STEP 3 I can also use this title space to set an expectation with my audience: 
Is this a good thing? Is it a bad thing? My previous suggested titles did not do 
this. But imagine that I title the slide, "Great work! NPS is increasing over time." 
Doing so would cause you to feel very differently about the data than if I were to 
title it "More work ahead: we still haven't hit top 3." The words we put around 
our data visualizations are critically important. Use this power carefully! 

As a related aside, I'm often asked about my choice of case (capital and low¬ 
er-case letters). For slide titles, I am in the habit of using sentence case (where 
the first word is capitalized and the rest is lowercase). I do this because I think 
sentence case lends itself more easily to action or takeaway titles (rather than 
title case, where every word is capitalized and is more likely to end up being a 
descriptive title, e.g. "NPS Over Time"). Be thoughtful and consistent in your use 
of letter case. 

And beyond all else, as we've seen before and will continue to explore—use 
words wisely! Employing takeaway titles is one way to use your words well. 
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Exercise 6.2: put it into words 

After creating a graph, I find it can be useful to come up with a sentence that 
describes the graph. This practice forces me to articulate a takeaway (or in some 
instances, a number of potential takeaways), which can sometimes even lead to 
different ways to show the data to better highlight the main point I want to make. 

Let's practice doing this with a specific graph. Imagine you work at a bank and 
you are analyzing collections data. Collections departments often use dialers, ma¬ 
chines that automatically place calls. Many calls go unanswered. When someone 
does answer, the collections agent is connected so they can talk to the individual 
to work out a payment plan and the account has been "worked." Numerous met¬ 
rics are tracked related to this—we'll look at penetration rate, which is the propor¬ 
tion of accounts worked relative to the total number of accounts dialed. 

Consider Figure 6.2a, which shows Accounts Worked, Dials Made, and the Pen¬ 
etration Rate. 
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FIGURE 6.2a Put it into words 


STEP 1: Write three distinct sentences articulating three different observations 
from this data. You may think of these as three potential takeaways that you could 
highlight in this data. 
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STEP 2: Which of the three sentences you've written would you focus on if you 
were communicating this data? Why? Are there any aspects of the others you'd 
also want to include? How can you achieve this? 

STEP 3: Are there any changes you'd make to the visual to better focus the audi¬ 
ence on the takeaway you've chosen to highlight? Outline those changes. 

STEP 4: Download the data and make the changes you've outlined in the tool of 
your choice. 


PRACTICE k/4 COLE 


PRACTICE \mM, COLE 


244 


tell a story 


Solution 6.2: put it into words 

Describing my graph in words forces me to really look at the data and think about 
what is important and which aspects I may want to point out to my audience. 

STEP 1: When I look at the data, I see things generally decreasing over the 
course of the year. But we can get more specific than that, which is one benefit 
from writing multiple sentences about a single graph (rather than just the first one 
that comes to mind). There are three data series depicted, so I'll write one obser¬ 
vation about each: 

1. The number of accounts worked varies over time and has generally de¬ 
creased over the course of the year. 

2. Dials made decreased 47% between January and December, with roughly 
250,000 dials made in December. 

3. Penetration rate has decreased markedly over time. 

STEP 2 : I'm inclined to want to focus on the decrease in Penetration Rate, since 
this reflects pieces of both of the other data series. That said, I wouldn't want to let 
go of all the other content, because this lends important context. It's interesting, for 
example, that Penetration Rate has decreased in spite of decreasing Dials Made. 
One might think that as fewer dials are made, the relative number of accounts 
worked would go up, but that clearly isn't what is happening. Maybe the easy ac¬ 
counts (those reachable or more likely to pay) have all been worked, so now there 
are fewer accounts to dial and work, but they are the more difficult ones? I'm guess¬ 
ing, but this would be the sort of context I'd be eager to learn more about in order 
to better understand what's driving what we are seeing in the data. 

Coming back to the question of how I'd incorporate aspects of the data—and 
playing off this exercise title-—I could plan to put some of it into words. For exam¬ 
ple, my second sentence outlined above, "Dials made decreased 47% between 
January and December, with roughly 250,000 dials made in December," could 
be context I incorporate by simply saying or writing it. Doing so opens up some 
additional potential ways to show the data, which we'll look at momentarily. 

STEP 3: Yes, there are changes I would make to how this data is shown. I like 
the general clean design of the graph. But currently both the legend at the top 
and secondary y-axis at the right mean my audience has to do work—some back 
and forth—to figure out how to read this data. I'd like to make this easier. Also, 
as mentioned, I think there is an opportunity to articulate some of the context in 
words so that we can focus on the Penetration Rate in the graph. 

STEP 4 Let's progress through a few views of this data so I can show you my 
thought process. First, I'll get rid of the secondary y-axis and Penetration Rate 
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data series that went with it (we'll reincorporate the latter momentarily). Notice 
also that the Accounts Worked in the original graph is a proportion of Dials Made, 
so if I change the data a bit, I can show these together. Rather than Dials Made 
and Accounts Worked, we can show those Worked and those Not Reached. See 
Figure 6.2b. 
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FIGURE 6.2b Modify the data so we can stack it 

In Figure 6.2b, the overall height of the bars (Worked plus those Not Reached) 
represents the total number of accounts that are dialed. We've already said we'll 
articulate the decrease in accounts dialed in words, which means we don't neces¬ 
sarily have to show it directly. In that case, one option could be to turn these into 
100% stacked bars. We'll lose sight to the decrease in accounts dialed, but gain 
a clearer picture of the ratio of accounts Worked versus those Not Reached—the 
Penetration Rate. See Figure 6.2c. 
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FIGURE 6.2c Change to stacked 100% bars 

The benefit to moving from absolute bars to 100% bars is to be able to more eas¬ 
ily see the proportion of accounts dialed that are worked. We can take this a step 
further, eliminating the space between the bars and modifying to an area graph. 
See Figure 6.2d. 
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FIGURE 6.2d Let's change to stacked area 
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I don't often use area graphs, but there are situations where they can work well 
and I think this is one of those. One sometimes-con of area graphs is that it isn't 
always clear whether the individual series are stacked on top of each other or 
meant to be read from the x-axis up in cumulative fashion. Here, given the 100% 
stack, this is likely intuitive. 

We get a couple of benefits from this view. We can clearly focus on the proportion 
of accounts that are Worked, given the emphasis via color. In this picture, the line 
that separates the green from the grey now represents the Penetration Rate. 

One of the things I was originally hoping to solve for was any back and forth be¬ 
tween the legend and the data. As we've seen in other examples, one method I'll 
often employ is to place the legend at the upper left (often just under the graph title, 
as I've done in this case). Given the zigzagging "z" of processing, this helps ensure 
my audience sees how to read the data before they get to the actual data. Another 
alternative is to label the data directly. I tried that, placing the Not Reached and 
Worked labels in the area graph as white text at the left and another option with 
them aligned to the right. These both looked messy to me, so I kept the legend 
separate at the top and chose to label just the Penetration Rate directly. 

Let's do that plus put a few additional words around the data and highlight the 
final data point. Without additional context to know what's driving what we're 
seeing, Figure 6.2e is where I'd end on this one. 


Total accounts dialed decreased 47% from January to December to 250K. 
During the same time period, penetration rate has declined markedly. 
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FIGURE 6.2e Put it into words! 
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I should probably mention that this particular makeover has been met with mixed 
responses. Some found the 100% aspect confusing or preferred the simple 
stacked bar view that also gave insight into absolute numbers. It's possible I've 
become overly attached to the solution I created. In spite of this feedback, I've 
chosen to include it here because it highlights a different approach than we typ¬ 
ically take, reinforcing the idea that it is okay to try things that are outside of the 
norm. If I were planning to use this in a business setting, I'd get additional feed¬ 
back to determine whether to push forward with this solution or modify it to best 
meet my audience's needs. 

The primary point is: putting our graphs into words can help us get clear on what 
we want to show and effective ways to show it. It can be helpful to put those 
words directly with the graph to help ensure it makes sense to our audience! 
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Exercise 6.3: identify the tension 

Let's step away from data and graphs for the next few exercises and dive deeper 
into elements of story. 

Tension is a critical—-and often overlooked—component when we communicate 
with data. When I teach the lesson on story in my workshops, I often get quite 
dramatic in my delivery, particularly in the discussion about tension. This is to 
emphasize the point, but shouldn't be taken as if we have to create drama for our 
stories to be effective. It's not about making up tension—if there weren't tension 
present, we'd have nothing to communicate about in the first place. Rather, it's 
about figuring out what tension exists and how we can illuminate it for our audi¬ 
ence. When we do this well, we get their attention and are in a better position to 
motivate them to act. 

Thinking back to some of the lessons we practiced in Chapter 1, getting to know 
our audience and what matters to them cannot be emphasized enough. It's easy 
to focus on what matters to us, but that's not a good way to influence. Rather, 
we need to step outside of ourselves and think about what tension exists for our 
audience. This relates back to one of the components of the Big Idea that we 
discussed: what is at stake? When we've effectively identified the tension in a 
situation, then the action we want our audience to take becomes how they can 
resolve the tension in the data story (we'll talk about this idea further in a number 
of the forthcoming exercises in this chapter). 

Let's look at a few different scenarios—some may sound familiar that we've dis¬ 
cussed before and some are new—and practice identifying the tension. Consider 
each of the following. First, identify the tension. Next, identify an action the 
given audience can take to resolve the tension you've identified. 

SCENARIO 1: You are an analyst working at a national retailer. You've just con¬ 
ducted a survey from the recent back-to-school shopping season, asking both 
your store's customers and your main competitors' customers about various di¬ 
mensions of the shopping experience. On the positive side, you've found the 
data confirms some things you thought to be true: people enjoy the overall expe¬ 
rience of shopping in your store and they have positive brand association. When it 
comes to opportunities, you've found there are inconsistencies in the service lev¬ 
els that customers are reporting across your stores. Your team has brainstormed 
solutions to this and wants to put forth a specific recommendation to the Head 
of Retail: sales associate training should be developed and rolled out to create 
shared understanding of what good service looks like to provide consistent exem¬ 
plary customer service. 

SCENARIO 2: You run HR at a company that has historically intentionally filled 
leadership at the director level through internal promotions (not hiring externally). 
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Attrition—people leaving the company—at the director level has increased 
recently. In light of this, you asked your team to build a forecast projecting the 
next five years based on recent trends for promotions, acquisitions, and attrition. 
You believe, based on the expected continued growth of the company, that un¬ 
less something changes, you will face a gap in future leadership talent needed 
compared to what you'll have. You'd like to use this data to drive a conversation 
among the executive team about what to do. The options, as you see them, are 
to better understand what's driving attrition at the director level and work to curb 
it, invest in manager development so you can promote at a faster rate, make 
strategic acquisitions to bring leadership talent into the organization, or change 
your hiring strategy and start to fill director-level positions through external hires. 

SCENARIO 3: You work as a data analyst at a regional health care center. As 
part of ongoing initiatives to improve overall efficiency, cost, and quality of care, 
there has been a push in recent years for greater use of virtual communications 
by physicians (via email, phone, and video) when possible in place of in-person 
visits. You've been asked to pull together data for inclusion in the annual review 
to assess whether the desired shift towards virtual is happening and make rec¬ 
ommendations for targets for the coming year. The main audience is leadership 
across the health care centers. Your analysis indicates there has indeed been a rel¬ 
ative increase in virtual encounters across both primary and specialty care. You've 
forecast the coming year and expect these trends to continue. You can use recent 
data and your forecast to inform targets. You believe seeking physician input is 
also necessary to avoid being over-aggressive and setting targets that may inad¬ 
vertently lead to negative impact on quality of care. 
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Solution 6.3: identify the tension 

There isn't a single right answer, but the following outlines how I would frame the 

tension and resolution in each of these cases. 

SCENARIO 1: 

• Tension: There is inconsistency in service levels across stores. 

• Resolution: Devote resources to developing and conducting sales associate 
training. 

SCENARIO 2: 

• Tension: Looking forward, we expect a shortage of directors given recent trends. 

• Resolution: Discuss and make a decision about what strategic change(s) we 
should make to fill roles at the leadership level. 

SCENARIO 3: 

• Tension: What's more important: efficiency or quality of care? The desired shift to¬ 
wards virtual encounters is happening, but how much more do we want to push? 

• Resolution: Use data together with physician input to set reasonable targets 
for the coming year to appropriately balance efficiency with quality of care. 
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Exercise 6.4: utilize the components of story 

The way I approach story has changed perhaps the most since writing 5WD. In 
SWD, story was examined through plays, books, and movies. The general struc¬ 
ture of story I put forth was that a story is comprised of a beginning, middle, and 
end. While this is useful, I believe we can take things a step further by considering 
the narrative arc. 



Stories have a shape. They start off with a plot. Tension is introduced. This tension 
builds in the form of a rising action. It reaches a point of climax. There is a falling 
action. The story concludes with a resolution. We are hardwired to engage with 
and remember information that comes to us in this general structure. 

The challenge is that the typical business presentation doesn't look anything like 
this! The typical business presentation follows a linear path. There is no up or down; 
we move straight through it. We start with the question we set out to answer, then 
discuss the data, followed by our analysis, and finally our findings or recommenda¬ 
tion. This linear path—by the way—is what our storyboards mainly looked like back 
in Chapter 1. Big advantage can be gained by rethinking the components of our 
storyboards along the narrative arc. Figure 6.4a shows the narrative arc. 
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FIGURE 6.4a The narrative arc 

Let's revisit a storyboard we've already completed. Refer back to Exercise 1.7, 
where we storyboarded about back-to-school shopping. Look to either your sto¬ 
ryboard or mine (Solution 1.7). How could you arrange these components along 
the narrative arc? Does this mean you reorder, add, or eliminate pieces? 

One way to accomplish this is to get a pile of sticky notes and write out the com¬ 
ponents of the storyboard from Solution 1.7. Then arrange them along the arc, 
augmenting with additional ideas and removing and rearranging as makes sense. 
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Solution 6.4: utilize the components of story 

Figure 6.4b shows how I might arrange components of the back-to-school shop¬ 
ping scenario along the narrative arc. 
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FIGURE 6.4b Back-to-school shopping narrative arc 

I could start by setting the plot. This is the basic information—the framing—that 
my audience needs to know to have consistent context to use as a jumping off 
point: "The back-to-school shopping season is a critical part of our business and 
we haven't historically been data-driven in how we've approached it." 

Next, I introduce tension and start to build the rising action. "We've conducted a 
survey, and, for the first time ever, we have data. The data shows that we are do¬ 
ing well in some areas, but underperform in several key areas!" This is the climax, 
where tension peaks. I can talk specifically about the areas that underperform and 
what is at stake for my audience as a result of this: we are losing to the competi¬ 
tion and will continue to do so unless we make a change. 

I can then soften things with the falling action. "Not all areas are equally import¬ 
ant, and we've identified a small handful on which to focus. Plus, we've already 
looked into several ideas for solving the issue and have narrowed in on one we 
think will have a big impact." Ending: "Let's invest in employee training to im¬ 
prove the in-store customer experience and make the upcoming back-to-school 
shopping season the best one yet." Here is what you—the audience—can do to 
resolve the tension I've brought to light. 
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Exercise 6.5: arrange along the narrative arc 

Let's look at another example of arranging the components of a story along the 
narrative arc. Well revisit a situation we looked at a couple of times in Chapter 1. 

Refer back to Exercise 1.8. Reread the pet adoption scenario. Did you complete 
this exercise and create a storyboard? If not, you can take some time to do so 
now, or review my example storyboard from Solution 1.8. How can you rethink the 
components making use of the narrative arc? 

For reference, a blank narrative arc follows (it will look familiar if you've already 
completed the preceding exercise; if not, you may want to read through it for 
additional general context). One way to complete this exercise is to write the 
components of your story on small sticky notes and arrange them over or under 
Figure 6.5a. Your stickies need not match those in the original storyboard: feel 
free to depart from this as you make use of this shape and its components. Be 
creative in your approach! 
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FIGURE 6.5a The narrative arc 
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Solution 6.5: arrange along the narrative arc 

This scenario seems like it may call for less rigidity than a typical business pre¬ 
sentation. On the other hand, lives are actually at stake (the animals who are po¬ 
tentially being adopted), so considering our audience and how best to convince 
them to get what we need is definitely important. Thinking back to exercises we 
tackled in Chapter 1: what motivates our audience? Is it meeting our adoption 
goal, or might we go beyond that? Different context and assumptions here will 
change how you approach. 

The narrative arc I created is shown in Figure 6.5b. 
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FIGURE 6.5b Pet adoptions narrative arc 

I took a bit of a risk in this example, starting by painting a picture (the plot) of a 
beautiful day at the park, where we are running a normal adoption event. Tension 
is introduced when we only have a few successful adoptions. This tension builds 
in the form of a rising action as I run my audience through the normal course of 
events for animals that end up in our shelter. This tension reaches a point of climax 
when innocent animals face euthanization. I can soften that climax (falling action) 
by describing the recent event that was unexpectedly moved to a local pet retail¬ 
er due to inclement weather and the success of the event. I can summarize the 
limited resources we'd need to do it again. My audience can resolve the tension 
(ending) by approving resources for the pilot program. 

I should point out that these are not the only components we should consider and 
this is definitely not the only possible order. Rather, this is one example of how we 
can make use of the narrative arc, given what we know and the assumptions that 
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I've made. As one example of a riff on this—particularly if I'm not confident I can 
keep my audience's attention until the end, or if I think they may simply approve 
my request and I don't have to take the time to go through the details—I could 
opt for leading with the ending. I could start by saying, "I need $500 and 3 hours 
of a volunteer's time to launch a pilot program that I believe will increase adop¬ 
tions: do you want to hear more?" (Note that this looks quite a lot like the Big Idea 
that we formed in Exercise 1.5!) 

There is no end to how we can rearrange or add or take away and there are nu¬ 
merous different approaches that would lead to a successful communication. Par¬ 
amount is that you give thought to how you do this to set yourself up for success. 
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Exercise 6.6: differentiate between live & stand¬ 
alone stories 

There are two common scenarios when we communicate for explanatory purpos¬ 
es with data: (1) we are presenting live to our audience (in a meeting or presen¬ 
tation, whether in person or virtually via webex or similar) and (2) we send some¬ 
thing to our audience (typically through email, though we may still encounter the 
occasional print-it-out-and-leave-it-on-someone's-desk situation). 

In practice, we often create a single communication that is meant to meet the 
needs of both of these instances. We touched on this briefly in SWD: this gives 
rise to the "slideument." It is part presentation, part document and doesn't exact¬ 
ly meet the needs of either circumstance. Typically, our content created to meet 
both of these needs is too dense for the presented version and sometimes not 
detailed enough for the version that is consumed on its own without you present 
to support it. 

There is an approach that I often recommend: build piece by piece for the live 
presentation, then end with a fully annotated slide. Let's do an exercise to practice 
and illustrate this concept. 

Imagine you are a consultant to Company X. You've been brought in to analyze 
the hiring process. Your goals are both to bring greater understanding to how 
things have been functioning generally—given that no one has spent much time 
looking at this data before—and use this to facilitate discussions with the steer¬ 
ing committee at Company X who has chartered the work to identify specific 
improvements. You've met with the steering committee several times already and 
developed a good understanding of the business context. Time to hire (the num¬ 
ber of days once a job opening is posted that it takes to fill) is one metric in which 
people are highly interested and will be the focus of this exercise. 

Figure 6.6a shows time to fill open roles (measured in days) for internal transfers 
and external hires for Company X. Spend a moment familiarizing yourself with this 
data, then complete the following steps. 
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FIGURE 6.6a Time to fill 

STEP 1: Let's say you will have an upcoming meeting with the steering commit¬ 
tee. You have 10 minutes on the agenda to discuss time to hire. You'd like to take 
a couple of minutes to set the context by walking your audience through the data 
in Figure 6.6a and use that to facilitate a conversation. Take advantage of the fact 
that you'll be live in person with your audience: rather than simply show Figure 
6.6a, consider how you might build the graph one or a few elements at a time. 
Create a bulleted list of what you would show, step by step. Feel free to liberally 
make assumptions. 

STEP 2: Download the data and create the progression you outlined in Step 1 in 
the tool of your choice. 

STEP 3: You anticipate the steering committee will want your visuals after the 
meeting. Rather than share the progression you went through, you've decided to 
build a single comprehensive graph (or slide). This will serve as a reminder of what 
you shared and will also be a good resource for anyone who misses the meeting. 
Create a visual to meet this need in the tool of your choice. 
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Solution 6.6: differentiate between live & stand¬ 
alone stories 

STEP 1: My progression building this graph piece by piece could look like the 
following. 

• Start with a skeleton graph that has x- and y-axis title and labels but no data. 
Use this to set the stage for my audience. 

• Add the Goal line, sharing any known context about how it was set. 

• Build the External line. Start with the first point in January, then add data 
through June and plan to talk through known context causing this trend. Then 
build the rest of the line, highlighting specific data I want to draw attention 
to as I do. 

• Build the Internal line. Push the External line back so it doesn't compete for 
attention, then build the Internal line in a similar fashion, highlighting and 
planning to raise points of interest along the way. 


STEP 2: The following shows how I would execute the steps I've outlined, along 
with my planned accompanying commentary. I've liberally made assumptions re¬ 
garding the context for the purpose of illustration. 

Let me take a few minutes to share with you recent data concerning time to hire. 
I'm going to use this to frame a conversation about some potential decisions you 
could make to impact time to hire going forward. 
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First, let me set up for you what we'll be looking at before distracting you with any 
data. On the vertical y-axis, I'll be plotting time to fill. This is the average number 
of days from an open job being posted to a successful hire for hires made in the 
given month. On the x-axis, I'll plot time. We're looking at data from 2019, starting 
with January on the left and going through December on the right. (Figure 6.6b) 
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FIGURE 6.6b Skeleton graph 

The company-wide goal is to fill roles within 60 days. (Figure 6.6c) 
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FIGURE 6.6c Introduce goal 







differentiate between live & stand-alone stories 


261 


Let's look at external hires first. Average time to hire in January was just under 45 
days, well under our 60-day goal. (Figure 6.6d) 
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FIGURE 6.6d First point of external line 

However, this increased steadily over the first half of the year. This coincides with 
increasing average number of interviews per candidate. As one may expect, the 
more interviews there are, the longer the hiring process. This led us to be just 
above goal in June, with average time to fill role of 67 days. (Figure 6.6e) 
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FIGURE 6.6e External time to fill increased in first half of year 
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Monthly time to fill for external hires varied quite a lot over the second half of the 
year. We found that those months having lower time to hire—denoted by blue 
markers—had fewer interviews per candidate on average. Both a greater quantity 
of interviews and interviewer vacation schedules likely contributed to the months 
above goal—designated by the orange markers. (Figure 6.6f) 
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FIGURE 6.6f External time to fill varied in second half of year 

Let's shift next to look at internal time to fill—these are those roles being filled by 
internal transfers. We started the year beating goal at 48 days to hire. (Figure 6.6g) 
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FIGURE 6.6g Add first point of internal line 
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Time to fill with internal candidates improved, decreasing the first few months 
of the year. In March and April, time to fill was under 3 weeks for internal candi¬ 
dates-—this is impressively fast! (Figure 6.6h) 
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FIGURE 6.6h Internal time to fill low first few months of year 


Time to hire increased in May. This coincided with an increase in the number of 
internal transfers, indicating that our processes might not be able to efficiently 
handle greater numbers of transfers. (Figure 6.6i) 
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FIGURE 6.6i Increased April to May 
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After May, there was a slight dip followed by another increase. (Figure 6.6j) 
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FIGURE 6.6j Another dip and increase 


September to November saw another dip then increase. (Figure 6.6k) 
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FIGURE 6.6k Yet another dip and increase 










differentiate between live & stand-alone stories 


265 


Though down from November to December, internal time to fill was higher than 
external in December. Though a bit noisy month-to-month, there was a general 
increase in time to fill internal hires in the second half of the year. (Figure 6.61) 


Time to fill 

Si 90 -| 

S. 80 



0 - 

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


2019 


FIGURE 6.61 Internal ends the year above external 
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Let's look at the full picture and summarize. Both external and internal time to fill 
roles have varied over the past year. While both beat the 60-day target in most of 
the year, we have seen time to fill generally increase over the latter part of 2019. 
Probably not unexpectedly —more interviews lead to longer time to fill roles. Va¬ 
cation schedules also contribute to delays. On the internal side, things take longer 
when we have more internal candidates, suggesting there could be some process 
improvements for better handling larger quantities. 

Let's discuss: what does this mean for the coming year? Are there any changes 
you'd like to make? (Figure 6.6m) 
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FIGURE 6.6m Let's discuss the implications looking forward 
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STEP 3: I might summarize the progression illustrated in Step 2 with the following 
fully annotated visual. See Figure 6.6n. 


Time to fill role discussion needed: where do we go from here? 

Both External and Internal time to fill have varied in the past year. Understanding contributing 
factors—number of interviews, vacation schedules, and current internal transfer volume 
constraints—can help us better plan for the future. 
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rising above goal in June at more than 
60 days. This was largely the result of 
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candidate. 


External time to fill varied markedly in the second half of 
the year, above goal in Sep & Nov. Months with lower time to 
fill had fewer number of interviews per candidate, while longer 
time to fill months had more m interviews. Interviewer 
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Internal time to fill consistently beat goal, with general increase 
in recent months. Months having lower internal time to fill coincide 
with those having fewer internal candidates placed. Time delays are 
experienced when there are more internal applicants. Further 
research is needed to better understand and remedy. 
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LET'S DISCUSS: Should we put stricter guidelines around maximum number of interviews? 
How can we keep vacation schedules from impacting time to hire? What can we do to improve 
efficiency of internal transfer process in order to better handle higher volumes? 


FIGURE 6.6n Fully annotated view to be distributed 

With Figure 6.6n, the audience processing it on their own—those who missed 
the meeting or need a reminder of what was covered—can read through a similar 
story to what I would walk through in a live setting. 


Consider how this approach of building piece by piece in a live meeting or pre¬ 
sentation, coupled with a fully annotated slide or two, could meet your needs for 
effectively telling your data stories. 
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Exercise 6.7: transition from dashboard to story 

In Chapter 1 of SWD, I drew a distinction between exploratory and explanatory anal¬ 
ysis. In a nutshell, exploratory is what you do to understand the data and explanatory 
is what you do to communicate something about the data to someone else. 

I consider dashboards to be a useful tool in the exploratory part of the process. 
There is some data that we need to be looking at on a regular basis (weekly, 
monthly, or quarterly) to see where things are in line with our expectations as 
well as where they are not. The dashboard can help us identify where there might 
be something unexpected or interesting happening. However, once we've found 
those interesting things and want to communicate them, we should take that data 
out of the dashboard and apply the various lessons that we've covered. 

Let's look at an example dashboard and practice how we can move from ex¬ 
ploratory dashboard to explanatory story. Refer to Figure 6.7a, which shows the 
Project Dashboard. You'll see demand and capacity breakdowns by a variety of 
categories (by region and department). The metric being graphed across the 
dashboard is project hours. 

This may feel familiar, as we've looked at some of this data already in Exercises 2.3 
and 2.4. Spend some time studying Figure 6.7a, then complete the following steps. 
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PROJECT DASHBOARD 

PERIOD: April 1 2019-December 31.2019 METRIC: Project Hours 

By Role 


Developer 


Business Analyst 
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Support Analyst 


Project Manager 


■ CAPACITY ■ DEMAND 


Demand by Sponsor Department 

(Empty) 

Supply Chain Logistics 
Growth & Innovation 
US Retail Sales 
Information Services 
Internal ET 
Human Resources 


STEP 1: Let's start by practicing putting it into words. Write a sentence describing 
a takeaway from each component of the dashboard in Figure 6.7a. 

STEP 2: Do we need all of this data? It may be important to look at project hours 
cut by each of these dimensions as part of our exploration of the data, but not all 
of the data is equally interesting when it comes to communicating it to our audi¬ 
ence. Imagine you need to tell a story with this data: which parts of the dashboard 
would you focus on and which would you omit? 

STEP 3: Create a visual story with the elements you selected to include in Step 2. 
Make assumptions as needed for purposes of the exercise. How would you show 
the data? How will you incorporate words? Decide whether you'll present live or 
send the information off to be consumed on its own. Optimize your approach for 
the scenario you've chosen. 
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FIGURE 6.7a Project dashboard 
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Solution 6.7: transition from dashboard to story 

STEP 1: I might summarize the main takeaway from the various components of 
the dashboard as follows. 

• Summary stats at top: Demand far exceeds capacity in the period 4/1/19 to 
12/31/19. 

• By Region: Demand exceeds capacity across all regions in roughly similar 
magnitudes. 

• By Month: The gap between capacity and demand, which was largest in June 
and generally quite high through Q2 and Q4, narrowed over the latter part 
of the year. 

• By Role: Demand exceeds capacity the most for Developers; the gap be¬ 
tween capacity and demand is also high for Business Analysts. 

• By Sponsor Department: We are missing a lot of data related to the source of 
demand—or perhaps not all projects have a sponsor department? 

STEP 2: I'll start by eliminating the things I don't want to include. Where things 
are pretty consistent or where we are missing data might be obvious aspects to 
omit (unless these vary from our expectations, in which case they may be interest¬ 
ing). There isn't a single correct answer: we are missing a lot of context, and have 
to make a number of assumptions. In real life, you'd want to build this context for 
yourself to make smart choices about where to focus your efforts, what is relevant, 
and which data to include or omit as you build your data story. 

I do see some interesting things happening overtime and by role, so I'll focus my 
attention there. In terms of changes to how the data is shown, I'll want to focus on 
the decreasing gap over time and also more clearly illustrate the difference be¬ 
tween capacity and demand by role. I'll put more words around the data—both to 
make it clear what we are looking at, as well as to help walk my audience through 
the story. 

STEP 3: I'll assume this is part of a year-end update and that the information is 
being sent off for my audience to consume on their own. Figure 6.7b illustrates 
what a single slide with the information I've chosen to concentrate on could look 
like (with my imagined context for illustration purposes). 
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Expect continued progress towards meeting demand 


OVER TIME: GAP DECREASING, BUT PERSISTS BY ROLE: BIGGEST GAP IN 2 AREAS 


Demand continues to exceed capacity as of year end. 
In 2019, we reduced the gap between demand and 
capacity markedly, mainly by clearing backlog of 
requested projects that aren't possible given current 
team structure and competing priorities. 


The biggest areas that have been over capacity 
are Developers and Business Analysts. 

Planned targeted hiring of these roles is 
expected to continue to reduce overall gap over 
time. We will continue to monitor and report. 
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FIGURE 6.7b Story on a single slide 


Let's discuss some of my decision points in Figure 6.7b. I employed a takeaway 
title at the top that goes beyond the data shown to set an expectation for the 
future. I chose a two-sided layout for my send-away story. You'll see additional 
examples of this approach, as it tends to be my go-to structure when you have 
multiple visuals you want to put on a single slide. Two visuals is the magic number 
for me when you need to show more than a single graph, because you can still 
make the graphs large enough to read and also have space to lend context via 
text (for more than this, I recommend breaking into multiple slides). 

In this example, the left-hand side focuses (through color and words) on the de¬ 
creasing gap over time. I opted for the stacked bar out of the various iterations 
we looked at in Exercise 2.4 because I like how this shows both Capacity and 
Demand, with the ability to focus attention on the Unmet Demand. On the right- 
hand side, I showed the breakdown by role in a slopegraph. This view of the data, 
coupled with strategic use of color and words, makes the Developer and Business 
Analyst clear points of focus. 

I've assumed the specifics in this scenario for purposes of illustration. I framed it more 
as an FYI, rather than any specific call to action, but a different spin on the details 
might necessitate something more pointed. Take note of how the steps we went 
through—putting our graphs into words, considering what to focus on and what to 
omit, graphing the data effectively and thoughtfully using color and words—help us 
transition from an exploratory dashboard to an explanatory data story. 
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Exercise 6.8: identify the tension 


As we've discussed, tension is a key component of story. Together, we practiced 
identifying the tension and a corresponding resolving action in Exercise 6.3. Here's 
an opportunity for you to do some additional practice on your own. 

Read each of the following scenarios (some may be familiar from their use else¬ 
where). For each, first identify the tension. Next, identify an action the audience 
can take to resolve the tension you've identified. 

SCENARIO 1: You are the Chief Financial Officer for a national retailer. Your 
team of financial analysts just completed a review of Q1 and have identified that 
the company is likely to end the fiscal year with a loss of $45 million if operating 
expenses and sales follow the latest projections. Because of a recent economic 
downturn, an increase in sales is unlikely. Therefore, you believe the projected 
loss can only be mitigated by controlling operating expenses and that manage¬ 
ment should implement an expense control policy ("expense control initiative 
ABC") immediately. You will be reporting the Q1 quarterly results at an upcoming 
Board of Directors meeting and are planning your communication—a summary 
of financial results in a PPT deck—-that you will present to the board with your 
recommendation. 

SCENARIO 2: Imagine you work for a regional medical group. You and several 
colleagues have just wrapped up an evaluation of Suppliers A, B, C, and D for the 
XYZ Products category. The data shows that historical usage has varied a lot by 
medical facility, with some using primarily Supplier B and others using primarily 
Supplier D (and only limited historical use of Suppliers A and C). You've also found 
that satisfaction is highest across the board for Supplier B. You've analyzed all of 
the data and realized there are significant cost savings in going with a single or 
dual supplier contract. However, either of these will mean changes for some med¬ 
ical centers relative to their historical supplier mix. You are preparing to present to 
the steering committee, where a majority vote will determine the decision. 
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SCENARIO 3: Craveberry yogurt is the new product that the food manufacturing 
company you are employed by is preparing to launch. The product team on which 
you work decided to do one more round of taste testing to get a final gauge of 
consumer sentiment. You've analyzed the results of the taste test and believe 
there are a couple of changes that should be made—minor things, but they could 
have major impact when it comes to consumer reception in the marketplace. You 
have a meeting with the Head of Product, who will need to decide whether to 
delay launch to allow time to make these changes, or to go to market with Crave¬ 
berry as it is. 


Exercise 6.9: move from linear path to narrative arc 

Before arranging potential elements of our story along the narrative arc, it can 
sometimes be helpful to start with a more linear view. Specifically, the chronologi¬ 
cal path is one that business presentations default to frequently. This makes sense 
from the standpoint that this is the order that comes to us most naturally because 
it's the general path that we went through to get from the initial question we set 
out to solve to our proposed outcome or course of action. 

However, the linear or chronological path isn't always the best path on which to 
take our audience. We should be thoughtful in how we organize the information 
through which we lead our audience. Rethinking things in light of the narrative arc 
can be one way to achieve this. Let's look at a linear storyboard we discussed in 
Chapter 1 about university elections and practice reimagining our potential com¬ 
munication path making use of the narrative arc. 

You're a rising university senior serving on the student government council. One 
of the council's goals is to create a positive campus experience by represent¬ 
ing the student body to faculty and administrators by electing representatives 
from each undergraduate class. You've served on the council for the past three 
years and are involved in the planning for this year's upcoming elections. Last 
year, student voter turnout for the elections was 30% lower than previous years, 
indicating lower engagement between the student body and the council. You 
and a fellow council member completed benchmarking research of voter turnout 
at other universities and found that universities with the highest voter turnout 
had the most effective student government council at effecting change. You think 
there's opportunity to increase voter turnout at this year's election by building 
awareness of the student government council's mission by doing an advertising 
campaign to the student body. You have an upcoming meeting with the student 
body president and finance committee where you will be presenting your findings 
and recommendation. 
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Your ultimate goal is a budget of $1,000 for an advertising campaign to increase 
awareness of why the student body should vote in these elections. To this end, 
your fellow councilmember created the following storyboard (Figure 6.9). Review 
the storyboard, then complete the following steps. 
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FIGURE 6.9 University elections colleague's storyboard 


STEP 1: Review the sticky notes in the storyboard in Figure 6.9 and determine how 
you could align them to the components of the narrative arc. Specifically, list which 
points you would cover in each section of the arc: plot, rising action, climax, falling 
action, and ending (you may not use all of the ideas from the original storyboard). 

STEP 2: Write the points you've outlined in Step 1 on your own stickies and 
arrange them in an arc shape. Continue to rearrange, add, remove, and change 
components of your story as you do this, making assumptions as needed. 


STEP 3: Did the process of physically arranging ideas along an arc change your 
approach? Write a paragraph or two about your process and any learnings. Is this 
a strategy you can envision employing in the future? Why or why not? 
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Exercise 6.10: build a narrative arc 

Let's make use of the narrative arc again. This time, we'll skip the storyboarding 
step and go straight from the scenario to creating our story arc using an example 
from Exercise 6.8 that was part of our tension identification practice. Refresh your 
memory by reading the following, then complete the subsequent steps. 

Craveberry is the new yogurt product that your food manufacturing employer is 
preparing to launch. The product team on which you work decided to do an ad¬ 
ditional round of taste testing to get a final gauge of consumer sentiment before 
going to market with the product. The taste test collects data on what participants 
like or don't like across a number of dimensions: sweetness, size, amount of fruit, 
amount of yogurt, and thickness. You've analyzed the results of the taste test 
and believe there are a couple of changes that should be made-—minor things 
that have the potential for major impact when it comes to consumer reception 
in the marketplace. Specifically, your recommendation will be to keep sweetness 
and size consistent. But people think the product is too thick and has too much 
fruit. Therefore, you recommend reducing the amount of fruit and increasing the 
amount of yogurt, which is expected to reduce the overall thickness. You have 
a meeting with the Head of Product, who will need to decide whether to delay 
launch to allow time to make these changes, or to go to market with Craveberry 
as it is. 

STEP 1: Get some sticky notes and write down the components (pieces of con¬ 
tent) that you believe will be part of your Craveberry story. 

STEP 2: Arrange the stickies you've created in the shape of an arc, aligning your 
various ideas to the components of the arc: plot, rising action, climax, falling ac¬ 
tion, and ending. Feel free to add, remove, or change things as necessary to meet 
your needs, making assumptions for the purpose of the exercise. 

STEP 3: Compare this process to that you experienced in Exercise 6.9. Did you 
find it easier to start with a storyboard or blank slate when it came to planning 
the components of your data story along the narrative arc? What are the resulting 
implications for the planning process you will undertake in the future? Write a 
paragraph or two outlining your observations and learnings. 
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Exercise 6.11: evolve from report to story 

Dashboards and regular reporting (weekly, monthly, quarterly)—we can use these 
tools as one way to explore our data and figure out what could be interesting, worth 
highlighting, or digging into further. There can also be great value from a self-service 
standpoint of sharing reports with end users, who can then use them to answer their 
many individual questions, freeing up your time for more interesting analysis. 

But too often, we share dashboards or reports meant for exploration when really 
we should be taking it a step further, making it clear to our audience what they 
should focus on and what they should do with the information we share. 


Examine Figure 6.11, which shows a page from a monthly report on ticket volume 
and related metrics. Then complete the following steps. 
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FIGURE 6.11 Key Metrics 
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STEP 1: Let's start by practicing putting it into words. Write a sentence describing 
a takeaway for each graph in the report shown in Figure 6.11. 

STEP 2: Do we need all of this data? It may be important to look at all of these 
things as we are exploring the data, but not all of the data is necessarily equally 
interesting when it comes to communicating it to our audience. Imagine you need 
to tell a story with this data: which parts of the report would you focus on and 
which (if any) would you omit? 

STEP 3: Download the data and create graphs and/or slides to tell a visual sto¬ 
ry with the elements you selected to include in Step 2. How would you show 
the data? How will you incorporate words? Create your preferred visuals. Decide 
whether you'll present live or send the information off to be consumed on its own. 
Optimize your approach for the scenario you've chosen, making assumptions as 
necessary for the purpose of this exercise. 
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Exercise 6.12: form a pithy, repeatable phrase 

Repetition helps form a bridge from our short-term memory to our long-term 
memory. We can make use of this in the actual words we use to communicate with 
data by articulating our main point in a pithy, repeatable phrase. 


Identify a project you are working on to communicate with data for explanatory 
purposes. Have you crafted your Big Idea? If not, refer back to Exercise 1.20 and 
complete it. Next, turn your Big Idea into a pithy, repeatable phrase. This can help 
you get clear on your goal when communicating, and can also be incorporated 
into your actual materials to help increase memorability for your audience. The 
pithy, repeatable phrase is short and catchy and it may incorporate alliteration. It 
need not be cute. It does need to be memorable. (If you crave an example—allow 
me to foreshadow—you'll get a chance to practice and see this idea employed in 
Exercises and Solutions 7.4 and 7.6.) 


In a live presentation, you can start with the pithy, repeatable phrase. You could 
also end with it, or you might weave it in different ways over the course of your 
presentation—so that when your audience leaves the room, they've heard it a few 
times. This means they are both more likely to remember and be able to repeat it. 

When you are sending something around (not presenting live), the pithy, repeat- 
able phrase can be written in words. You might opt to make it the title or subtitle 
of your deck. Or use it for the takeaway title of an important slide. Or put these 
words on the final slide your audience sees. In some situations, it may make sense 
to combine these ideas. Consider how you can use repetition in your words— 
whether spoken or written—to make your main point clear and memorable. 

The next time you communicate with data, contemplate how you can make use of 
a pithy, repeatable phrase. 
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Exercise 6.13: what's the story? 

We ask ourselves and each other versions of this question frequently when look¬ 
ing at data: what's the story? When we ponder this, we generally don't mean 
story at all, rather we are trying to understand the key takeaway or point. Clearly 
answering the question, "So what?" is the minimum level of "story" that must 
exist any time we communicate with data for explanatory purposes. Too often, 
we leave it to our audience to figure this out on their own, and good work and 
improved understanding is missed as a result. 

In the past, I've differentiated how we think about story as it relates to our data 
communications into two types. I refer to them as story with a lowercase "s" and 
Story with a capital "S ." Let's discuss each of these, including tips for how to think 
about and make use of them at work. 

story with a lowercase "s" 

For every graph and slide you create, ask yourself: "What is the main point?" Put 
it into words, as we practiced in Exercises 6.2, 6.7, 6.11,7.5, and 7.6. Once you've 
articulated your point, take intentional steps to make that point clear to your audi¬ 
ence. Employ a takeaway slide or graph title to set expectations for your audience 
(for a reminder of what this is or additional practice, refer back to Exercises 6.1 
and 6.7). Focus attention as we practiced in Chapter 4. Use words—either in your 
spoken narrative or written physically on the page—to explain to your audience 
what they should see and what it means. 

Never leave your audience wondering, "So what?" Answer this question clearly 
for them! 

Story with a capital "S" 

Clearly articulating the primary takeaway is a step in the right direction, but there 
is a whole other level of story that we can employ: Story with a capital "S." This is 
Story in the traditional sense. It starts with a plot, then tension is introduced. The 
tension builds to a point of climax. It is followed by a falling action, which brings 
us to the resolution. Well told stories get our attention, stick with us, and can be 
recalled and retold. We can make use of Story strategically when we communi¬ 
cate with data. 

The tool I recommend for structuring your Story is the narrative arc. When we con¬ 
ceive our data story in the shape of an arc, it forces a couple of things. First, to 
create the rise, we must identify tension. As a reminder, this is not the tension that 
exists for us, but rather the tension that exists for our audience. It's not about mak¬ 
ing up tension—if tension didn't exist, you'd have nothing to communicate about 
in the first place. Viewing our path as an arc also encourages us to think about how 
one idea or component relates to the next. This is easy to skip when we arrange 
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things linearly and can help us identify where we might need additional content or 
transitions to ensure smooth flow. The arc forces us to think about the path on which 
we take our audience. Perhaps more important than any of this, the narrative arc 
encourages us to think about things from our audience's perspective in a way that 
linear storyboarding simply doesn't facilitate in the same way. This is the most im¬ 
portant shift I see happen when people move from how we typically communicate 
in a business setting to using the narrative arc and Story: with Story, we must step 
outside ourselves and think critically about what will work for our audience. 

Consider how you can use story with a lowercase "s" and Story with a capital "S" 
the next time you communicate with data. When it comes to the latter, you'll find 
more specific steps for making use of the narrative arc in the next exercise. 
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Exercise 6.14: employ the narrative arc 

The stories we encounter in books, movies, and plays typically follow a path: the 
narrative arc. We can use the narrative arc to our advantage when telling stories 
with data as well. 

Figure 6.14 shows the general narrative arc. 


a\ max 


FALLING 

RISINt 

ACTION 

ACTION 


PLOT ending 


FIGURE 6.14 The narrative arc 

Let's review each component of the arc, with some related thoughts and ques¬ 
tions that you can use when communicating with data. 

• Plot: What does your audience need to know in order to be in the right frame 
of mind for what you will be asking of them? Identify the tacit knowledge you 
have that would be helpful to communicate explicitly to ensure people are 
working from the same set of assumptions or understanding of the situation. 

• Rising action: What tension exists for your audience? How can you bring that 
to light and build it-—to the level appropriate given the circumstances—for 
your audience? 

• Climax: What is the maximum point of tension? This isn't tension for you, but 
rather tension for your audience. Think back to the Big Idea and conveying 
what is at stake. What does your audience care about? How can you utilize 
that to get and maintain their attention? 
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• Falling action: This is perhaps the fuzziest of the components when it comes 
to application in a business setting. The main purpose is so that we don't go 
abruptly from the highest point of tension—the climax—to the ending. The 
falling action is a buffer to ease this transition. In our data stories, it can take 
the form of additional detail or further breakdown (here's how the tension 
plays out by product or region), or could be potential options you've weighed, 
solutions you may employ, or discussion you'd like to facilitate among your 
audience. 

• Ending: This is the resolution, the call to action. The ending is what your au¬ 
dience can do to resolve the tension that you've brought to light. Note that it 
isn't typically so simple as "We found X; therefore you should do Y." Our data 
stories are often more nuanced than that. This ending could be a conversation 
you want to drive, options to choose from, or perhaps even input you need 
from your audience to fully flesh out your story. In any case, identify the action 
you need your audience to take and how to make it clear and compelling. 

From my perspective, in a business setting, it's less important that your story fol¬ 
low the arc in this order exactly (we see stories in everyday life veer from this path 
often anyway through flashbacks, foreshadowing, and so on) and more important 
that each of these pieces are present. In particular, I find that when we go with the 
typical linear path for communicating information in a business setting, tension 
and climax can be missed entirely. As we've discussed, these are critical elements 
of story. By not bringing them to light, we are doing our data and the stories we 
want to tell with it a disservice. 

That said, it can sometimes be difficult to make the jump from all we know about 
a given scenario to the arc. Storyboarding can be a good interim step. Refer to 
Exercise 1.24 for instructions on storyboarding. Once you have your storyboard, 
go through the process of arranging components along the arc. This is simply an¬ 
other tool in your communications toolbelt that you can use to your advantage in 
some situations. I find when I have something important to communicate, going 
through the process of arranging the pieces along the arc can highlight when I 
might be missing something to make the pieces fit together or not fully thinking 
about my audience, the tension, and how they can help resolve it. 

Consider how you can use the narrative arc when communicating to tell a data story 
that will get your audience's attention, build credibility, and inspire them to act! 
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Exercise 6.15: let's discuss 

Consider the following questions related to Chapter 6 lessons and exercises. Dis¬ 
cuss with a partner or group. 

1. What is a takeaway title? How does it differ from a descriptive title? When, why, 
and where would you choose to use a takeaway title in your communications? 

2. What role does tension play when communicating with data? How can you 
identify the tension in a given situation? Reflect on a current project: what 
is the tension? How might you incorporate this tension in your data story? 

3. What are the components of the narrative arc? Can you list them? When and 
how can you make use of the narrative arc when communicating with data? 
Are any components of the arc fuzzy or confusing? Are there any you'd like 
to talk about more? 

4. How should we order the various components of our data-driven stories? 
What should we consider when determining how to organize the information? 

5. Do you anticipate resistance from your audience or other challenges using 
story to communicate in the ways outlined in SWD and illustrated in this 
chapter? How will you deal with it? When does it not make sense to use story 
to communicate? 

6. How can we use repetition strategically in our data communications? Why 
would we want to do this? 

7. What is different when you present live to your audience (in a meeting or 
presentation) compared to when you send something off to be consumed 
on its own? How should the materials you create differ? What strategies can 
you employ in each of these scenarios to help ensure success? 

8. What is one specific goal you will set for yourself or your team related to the 
strategies outlined in this chapter? How can you hold yourself (or your team) 
accountable to this? Who will you turn to for feedback? 
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While prior chapters have provided a piecemeal focus on the given lesson, this 
chapter takes a more comprehensive view of the entire storytelling with data pro¬ 
cess. Real-world based scenarios and related data visualizations are introduced and 
paired with specific questions to consider and solve. These are followed by step-by- 
step illustrations that give full insight into my thought process and design decisions. 

I encounter many examples of data communications through our workshops. Cli¬ 
ents share their work ahead of time and we use this as the basis of discussion and 
practice. These cross many topics and industries and there is something to be 
learned from each and every one. The process of creating data visualization make¬ 
overs from select examples to highlight storytelling with data lessons has been 
key to honing my own and my team's skills for critiquing, remaking, and sharing 
and discussing examples. In this section, you'll get an opportunity to practice just 
like the storytelling with data team—then you'll be walked through our solution as 
if you were a participant in one of our hands-on workshops. 

Though the lessons in SWD and here could be taken as step-by-step instruction, 
my typical approach for moving from data to data story is more holistic, which is 
how you'll see it addressed in the forthcoming examples. Rather than go through all 
parts of the process each time, the various examples are used to highlight different 
components, exposing you to varied challenges and potential solutions. We will 
start out with some simple graph and slide redesigns and get increasingly compre¬ 
hensive as we move through the case studies presented and solved in this chapter. 

Let's practice! 

Before we do, we'll review the main lessons we've covered so far. 
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Exercise 7.1: new advertiser revenue 

Imagine you are an analyst at a digital marketing company. A new feature was 
rolled out in 2015—let's call it Feature Z—that allows your company's clients to 
create better ads and will introduce a new revenue stream for your platform. The 
challenge is that Feature Z has a steep learning curve, so there's been some dif¬ 
ficulty getting clients to utilize it. Overall, you have seen improvement overtime, 
both in terms of clients using Feature Z and increased revenue from it. At a recent 
meeting when discussing this topic, the head of client support raised a question 
about what Feature Z adoption looks like for new advertisers specifically—those 
creating an advertisement on your platform for the first time. No one has sliced 
the data to look at this before, so you're working with a colleague to answer this 
question. 

Your colleague put together the heatmap shown in Figure 7.1a. Spend a few mo¬ 
ments studying it, then complete the following steps. 
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FIGURE 7.1a Advertisers are getting more sophisticated sooner 


STEP 1: It's easy to jump to what's "wrong" when faced with someone else's data 
viz; let's pause first and reflect on positive points to share as feedback. What do 
you like about the current visual? Write a sentence or two. 


STEP 2: What is not ideal in Figure 7.1a? Make a list. 


STEP 3: How would you show this data? Download the data and iterate in the 
tool of your choice to create your preferred view. 
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STEP 4: What is the tension in this scenario? What action do you want your audi¬ 
ence to take to resolve this tension? 

STEP 5: You've been asked to provide a single slide that tells the story. Create 
this slide in the tool of your choice, making assumptions as necessary for the pur¬ 
pose of this exercise. 
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Solution 7.1: new advertiser revenue 

STEP 1: There are a couple of things I like in this visual. Words are generally used 
well. The takeaway title at the top does a nice job setting up the story and start¬ 
ing to address the question posed. Axes are titled directly. I also like that there is 
annotation directly on the graph (in the grey text box on the right) that supports 
the takeaway title and also is connected directly to the data that is described so 
I don't have to search for it. That said, I don't love the arrows for creating this 
connection because they look messy and cover up some of the data. Oops, I've 
already jumped to what I'd do differently! Let's address that next. 

STEP 2: There are a number of things that could be improved in this visual. The 
following are three primary aspects that are not ideal from my perspective. 

• The table feels like work. I find the tabular data difficult to wrap my head 
around. The heatmapping helps, but it still requires effort to figure out 
what we're meant to see. 

• We've employed a problematic color scheme. The red/green color 
scheme will be an issue for colorblind individuals in our audience. Be¬ 
yond that, the red and green are competing for my attention, making it 
difficult to focus. 

• There is lack of alignment. The various elements on the page aren't 
aligned. We see a mix of left-aligned, centered, and right-aligned text 
and numbers, without an apparent reason why. This creates an overall 
disorganized feel. 

STEP 3: Iterating through multiple ways of graphing the data will likely be nec¬ 
essary to observe how variant views allow us to more or less easily see different 
things. We have time in a couple of ways in this example: by quarter for the first 
time an ad was created and by advertiser age. This means we could graph this 
data in lines two totally distinct ways. I'll start by creating a couple of quick and 
dirty graphs of the data (these are simply default charts; I'm not worrying at all 
about formatting at this step). See Figure 7.1b. 
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FIGURE 7.1b Quick and dirty views of data 

Let's consider what the visuals in Figure 7.1b allow us to see. In both cases, the 
y-axis represents the percent of total revenue driven by Feature Z. On the left- 
hand side, my x-axis is the date when the first ad was created. Each line represents 
advertisers of the given age group. This allows us to see that those in the first 
quarter (0) are less successful (the dark blue line for 0 quarters is at the bottom of 
the graph, where percent of revenue is lowest), though improving (the line for 0 
increases as you move from left to right). Revenue goes up generally as age in¬ 
creases (the lines as we move upwards in age are increasingly high on the graph), 
with those in their 15th quarter appearing highest on the graph (we don't see a 
line for 16 quarters of tenure since there's only a single data point and you need 
two points to make a line). If you're reading this and thinking, Wow, this sounds 
complicated—you are correct. Let's shift our attention to the second graph. 

The right-hand side plots advertiser age on the x-axis. Each line represents the 
quarter in which the first ad was created. Lines going upward to the right illustrate 
sophistication increasing with advertiser age. Lines moving upward on the graph 
shows sophistication increasing earlier—with more recent quarters at the top. No¬ 
tice how much simpler that was to put into words compared to the left-hand side! 
This is going to be a good general view of the data given what we want to illustrate. 

That said, this is still a lot of data to process. Do we need it all? Perhaps we could 
simplify by showing less. One way to do so would be to show fewer quarters of 
data. That said, I don't necessarily want to narrow our time window, since I'd like 
to be able to compare how things looked when Feature Z was introduced in 2015 
and our recent data. But that still leaves me a couple of options. Given that the 
most recent data point is Q1 2019, I could elect to show just Q1 line for each year. 
Another approach could be to roll this data into annual cohorts. Look back to the 
right-hand side of Figure 7.1b and imagine we have just 5 lines of data (2015, 
2016, 2017, 2018, 2019). Aggregating in this way would simplify things and allow 
us to clearly make our point: we're seeing more revenue sooner as advertiser so¬ 
phistication gets better with both time (increasing as we move up the graph) and 
age (increasing rightwards). Bingo! 

STEP 4: Let's step back from the data for a moment to identify the tension and res¬ 
olution. I'd characterize the tension as: while we're generally seeing improvement in 
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adoption of and revenue from Feature Z, we don't know how this plays out for our 
newbie advertisers. Are things okay? Are there issues we need to address? 

The resolution is that things look good—no immediate action is needed. I often 
suggest that if we can't clearly articulate the action we want our audience to take, 
we should revisit the need to communicate in the first place. But here, our audience 
posed a question and though no action is needed, this isn't a reason not to answer 
it! Still, let's be clear on what we need our audience to do: they should be aware of 
this and they don't need to do anything right now. We can take the action to contin¬ 
ue to monitor things from this perspective and alert them if this changes. 

STEP 5 Figure 7.1c shows my single-slide story. Take a moment to study it and 
compare to what you created. Are there similarities? Where are there differences? 
Note what works well in each approach. 
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The proportion of revenue from Feature Z is higher sooner: 
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The proportion of revenue from Feature Z generally 
increases with advertiser tenure. By 16 quarters of 
advertising, 50% of revenue came from Feature Z 
for those first creating an ad in 2015. Given the 
increase in sophistication over time, this may end 
up being even higher for more recent cohorts. 
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FIGURE 7.1c Sophistication increasing with time & age 

We've reviewed a number of lessons over the course of this solution. Always ask 
yourself: do you need all the data? Determine what you want your audience to 
see, then select a visual that will facilitate that, iterating as needed to identify 
an effective view. Design thoughtfully, aligning elements to create structure, us¬ 
ing color sparingly to direct attention, and titling and annotating effectively with 
words to help the data make sense. 
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Exercise 7.2: sales channel update 


The following slide (Figure 7.2a) shows unit channel sales over time for a given 
product. Familiarize yourself with the details, then complete the following steps 
on your own or with a partner. 
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FIGURE 7.2a Sales channel update 

STEP 1: Let's start with the positive: what do you like about this slide? 

STEP 2: What questions can we answer with this graph? Where specifically do we 
look in the data for answers to these questions? How effectively are these ques¬ 
tions answered? Write a couple of sentences outlining your thoughts. 

STEP 3: What changes would you recommend based on the lessons we've cov¬ 
ered? Write a few sentences or make a list outlining specific points of feedback 
and how you would resolve. 

STEP 4: Consider how your approach would vary if you were (1) presenting this 
data live in a meeting and (2) sending it to your audience to be consumed on its 
own. How would the way you'd tackle this differ? Write a few sentences explain¬ 
ing your thoughts. To take it a step further, redesign this visual for these different 
use cases in the tool of your choice. 
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Solution 7.2: sales channel update 

For me, this is a situation where we are trying to answer too many questions in a 
single graph. In doing so, we don't answer any single question effectively. Rather 
than pack a lot into one graph, we will be better off allowing ourselves to have 
multiple visuals. 

STEP 1: What do I like about the original slide? I like that someone outlined 
some specific points of interest via the bulleted text below the graph. The general 
design of the graph is also pretty clean; there isn't a lot of clutter to distract from 
the data. 

STEP 2: We can answer a couple of different primary queries with this graph: 
how have unit sales changed over time? How has the composition of sales across 
channels changed over time? We're meant to see the former by comparing the 
tops of the bars and the latter by comparing the pieces within the stack. Stacked 
bars are challenging, though, because when you stack things that are changing 
on top of other things that are changing, it becomes very difficult to see what is 
happening. The second point outlined via bullets says there was a decision made 
to shift sales from retail to partners. Does the data show that happening? Is this a 
success story or a call for action? It's tough to tell! 

STEP 3: I have three major points of feedback that I will address in my makeover 
of this visual. Let's talk through each of these. 

Use multiple graphs. The biggest change I will make is to use multiple graphs. I 
sometimes think of the stacked bar chart like a Swiss Army knife. You can do many 
things with it and sometimes constraints may necessitate its use. But you can't do 
any of these tasks quite as well as if you had the dedicated tool. Sure, the scissors 
on a Swiss Army knife work well enough to cut a loose thread, but for anything 
more I'd much rather have a pair of scissors. Instead of the stacked bar, I'll use 
two different graphs that each directly answer the questions outlined in Step 2. I'll 
walk you through the specifics in Step 4. 

Tie text visually to data. In terms of additional changes, as mentioned in Step 1, 

I like that someone looked at this data and outlined takeaways. The challenge, 
though, is that when I read the text at the bottom of the slide, I have to spend 
time thinking and searching to figure out where to look in the data for evidence 
of what is being said. I'll want to solve for this—when someone reads the text, I 
want them to know where to look in the data. When someone sees the data, I 
want them to know where to look in the text for related context or takeaways. To 
connect the text and data, we should think back to the Gestalt principles. We can 
use proximity, putting the text close to the data it describes. We can use connec¬ 
tion and physically draw a line between the text and data. We can use similarity, 
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making the text the same color as the data it describes. I'll employ aspects of each 
of these approaches in my solution. 

Clearly differentiate forecast data. If I take the time to read through all of the 
text, the final bullet raises something perhaps unexpected—not all of this data is 
real. The final point, 2020, is a forecast. But there's nothing done to the design of 
the data to indicate this to us. I'll want to change that and clearly indicate which 
points are actual data and which are forecast. 

STEP 4: My makeover will address the points of feedback raised in Step 3. Let's 
first tackle the initial question—how have total unit sales changed over time? See 
Figure 7.2b. 


Total units sold 



2012 2013 2014 2015 2016 2017 2018 2019 2020 

FISCAL YEAR projected 

FIGURE 7.2b Visualize units sold overtime with a line 

I moved from bars to a line so the connection between points would allow us to 
more easily see the trend. I chose to omit the axis and instead labeled a few of 
the data points; why I chose the ones I did will become clear momentarily. I made 
a visual distinction between the actual data (solid line, filled points) and projected 
data (dotted line, unfilled point) and also added the descriptor text "Projected" 
to the x-axis tied through proximity to the 2020 label. 

If I will be presenting this information in a live setting, that opens up additional op¬ 
portunities. Any time we are looking at data overtime, we have a natural construct 
for storytelling: the chronological story. In a live setting, I can build the graph piece 
by piece, talking through relative context as I do. See the following (Figure 7.2e - 
7.2r) for an illustration of how I might do this, paired with my written voiceover. 
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Today, we'll be looking at unit sales over time. I'll start back in 2012, we'll look at 
actual data through 2019 and then our latest projection for 2020. (Figure 7.2c) 
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FIGURE 7.2c In a live progression, first set up graph 


Our product hit the market in 2012. Sales that first year amounted to twenty-two 
and a half thousand units, which we were very happy with against our initial target 
of 18,000 units. (Figure 7.2d) 
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22.5K 


2012 2013 2014 2015 2016 

FISCAL YEAR 


2017 2018 2019 2020 

PROJECTED 


FIGURE 7.2d Live progression 
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Sales increased slightly in 2013, to just over 23,000 units sold. (Figure 7.2e) 

Total units sold 


22.5K 


23.1 K 


2012 2013 2014 2015 2016 2017 

FISCAL YEAR 


2018 2019 2020 

PROJECTED 


FIGURE 7.2e Live progression 


But then in 2014, we encountered production issues. As a result, we weren't able 
to keep up with demand. Units sold plummeted. (Figure 7.2f) 

Total units sold 


23. IK 



2012 2013 2014 2015 2016 2017 

FISCAL YEAR 


2018 2019 2020 

PROJECTED 


FIGURE 7.2f Live progression 
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We quickly recovered, fixing the issues and hitting nearly 24,000 in unit sales in 
2015. (Figure 7.2g) 

Total units sold 


23.9K 



2012 2013 2014 2015 2016 2017 

FISCAL YEAR 


2018 2019 2020 

PROJECTED 


FIGURE 7.2g Live progression 


Sales continued to increase through 2016 and 2017. (Figure 7.2h) 

Total units sold 


26.8K 
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2018 2019 2020 

PROJECTED 


FIGURE 7.2h Live progression 
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In 2017, we made the decision to start shifting sales from retail to our partner 
channel. We saw sales in 2018 dip as a result of this. (Figure 7.2i) 

Total units sold 


26.8K 



2012 2013 2014 2015 2016 2017 

FISCAL YEAR 


2018 2019 2020 

PROJECTED 


FIGURE 7.2i Live progression 


We recovered from this dip in 2019. (Figure 7.2j) 

Total units sold 


26.5K 



2012 2013 2014 2015 2016 2017 

FISCAL YEAR 


2018 2019 2020 

PROJECTED 


FIGURE 7.2j Live progression 
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We expect continued increasing sales in 2020. (Figure 7.2k) 

Total units sold 



2012 2013 2014 2015 2016 2017 

FISCAL YEAR 


2018 2019 2020 

PROJECTED 


FIGURE 7.2k Live progression 


If I weren't presenting live, I could annotate this context directly on the graph so 
that my audience processing it on their own would get the same sort of story I'd 
walk my audience through in a live setting. See Figure 7.21. 


Total units sold 


22.5K 


2012: launch 

beating goal of 
18K units sold. 


26.8K 



26.5K f 7 0 5K 


2014: production issues 

led to not enough units 
being made and resulting 
dip in unit sales that year. 


RECENT YEARS: decision made 
in 2017 to shift sales from Retail 
to Partner, leading to drop in total 
unit sales in 2018, bouncing 
mostiy back in 2019 and expected 
continued increase in 2020. 


2012 2013 2014 2015 2016 2017 2018 2019 2020 

FISCAL YEAR projected 

FIGURE 7.21 Annotated line graph 

We'll look momentarily at how we could integrate a version of this graph into a 
slide. But before we get there, let's determine how we can answer the relative 
composition question in a live setting: with a 100% stacked bar. The 100% stacked 
bar does face some of the same challenges as the typical stacked bar, in that the 
middle segments are harder to compare. Flowever, we also get some benefit. 
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With a consistent baseline both along the bottom and top of the graph, there are 
two data series that our audience can more easily compare overtime. Depending 
on what we need to highlight and if we are smart about how we order the data, we 
can actually make this work quite well. Let's tune back into my live presentation: 

Let's look next at the composition of sales across channels. (Figure 7.2m) 


Units sold by channel 



2012 2013 2014 2015 2016 2017 2018 2019 2020 
FISCAL YEAR projected 


FIGURE 7.2m Another view: channel breakdown 


The retail channel has decreased as a proportion of total overtime. (Figure 7.2n) 

Units sold by channel 
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FISCAL YEAR projected 



FIGURE 7.2n Focus on retail 
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E-commerce has increased marginally since launch, but has made up a consistent 
proportion of total sales in recent years. (Figure 7.2o) 


Units sold by channel 



2012 2013 2014 2015 2016 2017 2018 2019 2020 
FISCAL YEAR projected 


FIGURE 7.2o Focus on e-commerce 


Direct mail is tiny, has always been tiny, and will stay tiny. (Figure 7.2p) 


Units sold by channel 



2012 2013 2014 2015 2016 2017 2018 2019 2020 
FISCAL YEAR projected 


FIGURE 7.2p Focus on direct mail 
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Partner sales have increased as a percent of total. (Figure 7.2q) 


Units sold by channel 



2012 2013 2014 2015 2016 2017 2018 2019 2020 
FISCAL YEAR projected 


FIGURE 7.2q Focus on partner 


Most noteworthy is the change we've seen in composition of sales by channel 
over time. In particular, since the decision was made in 2017 to shift sales from 
retail to partner: we've seen that change happen. Retail is making up a decreasing 
proportion of total sales, while the partner channel is making up an increasing 
proportion of sales over time. We expect this will continue in 2020. (Figure 7.2r) 


Units sold by channel 
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Direct Mail 
E-commerce 
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2012 2013 2014 2015 2016 2017 2018 2019 2020 
FISCAL YEAR projected 


FIGURE 7.2r The desired shift in channels has happened: success! 
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This is a success story! If we need to pull that together in a way that can stand on 
its own to be sent around, I would opt for a single slide with takeaway titles, clear 
structure, and more written words to lend context. See Figure 7.2s. 


Total sales increasing as channels shift 


2020 EXPECT INCREASE IN SALES CHANNEL SHIFT OVER TIME 


The 2012 launch yielded 22.5K units sold, beating 
initial 18K goal by 25%. In 2014, production issues 
led to not enough units being made and resulting in a 
22% drop in annual unit sales. Unit sales increased 
through 2017. when the decision was made to shift 
channel sales from retail to partners. This led to a 
drop in 2018. bouncing mostly back in 2019, and 
expected to continue to increase in 2020. 


The relative composition of sales channels has 
shifted markedly over time. Unit sales from the 
Retail channel, which historically accounted for 
60% of total sales, has decreased to 38% in 2019 
due to intentional shift towards the Partner channel, 
where contribution to total unit sales has increased 
from less than 20% in our early years to 30% in 
2019. We expect these trends to continue. 


Total units sold 


26-8K 26.5K 

24.6K ..° 



2017 decision 
to shift sales from 

Rota 11 to Partner 


2012 2013 2014 2015 2016 2017 2018 2019 2020 

FISCAL YEAR ” OJK 


Units sold by channel 



2012 2013 2014 2015 2016 2017 2018 2019 2020 
FISCAL YEAR wojscteo 


FIGURE 7.2s Single slide for distribution 

By allowing ourselves to have more than one graph, we can effectively answer 
multiple questions. Note also the various ways in which text is used in the preced¬ 
ing visual, and the numerous ways words are visually tied to the data. When my 
audience reads the text, they know where to look in the data for the important 
things and vice versa. Not only will this be a more pleasant experience for my au¬ 
dience to process this data, but we can also get much more information out of it! 
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Exercise 7.3: model performance 

You work at a large national bank managing a team of statisticians. One of your 
employees shares the following graph (Figure 7.3a) with you during their weekly 
one-on-one and asks for your feedback. Spend a moment analyzing it, then com¬ 
plete the following steps. 


FIGURE A: Model Performance by LTV 


V3.7 FIX3ED0 TRACKING ON ONBS DATA AS OF 31AUG19. 

out_of_time=y 



ltv_bin 


- actual - model — -UPB 

out_of_time=y IF eff_date >= '31Aug17\ 


FIGURE 7.3a Model performance by LTV 

STEP 1: What questions would you ask about this data? Make a list. 

STEP 2: What feedback would you give based on the lessons we've covered? 
Outline your thoughts, focusing not just on what you would recommend chang¬ 
ing, but also why. It's by grounding our feedback in underlying principles that we 
help improve not just a single graph, but also deeper understanding that can lead 
to better data visualization in the future. 

STEP 3: How would you recommend presenting this? What is the story and how 
can it be brought to life? Make an assumption about whether this information will 
be presented live or sent around, and outline your recommended plan of attack in 
light of this assumption. Take things a step further by creating your recommended 
communication in the tool of your choice. 
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Solution 7.3: model performance 

STEP 1: This is a difficult one! I have so many questions about the graph that it's 
hard to step back and consider what the data is showing. I would ask about the 
language used: what do the acronyms mean and what do the axes represent? 
Which data is meant to be read on the secondary y-axis on the right-hand side of 
the graph? Why have these particular choices for line style been made? What is 
the big red box meant to highlight? 

STEP 2: After getting answers to my questions raised in Step 1, I would offer the 
following feedback. 

Use more approachable language. This looks to me like output from statistical 
programming software (SAS or similar). If you are a statistician and you are com¬ 
municating to your colleagues who are also statisticians, this is totally fine. If you 
are communicating to anyone else on the planet, you need to turn this into acces¬ 
sible language. Rather than put things like "vol_prepay_rt" (y-axis title in Figure 
7.3a), we should translate it into voluntary prepayment rate. This is the proportion 
of people who are paying off their loan before it is due. The only reason I have any 
idea what's going on in this graph is because I used to work in Credit Risk Man¬ 
agement, so I have enough banking subject matter expertise to make sense of it. 

On the topic of comprehensible language, you should also spell out any acronyms. 
If someone in your audience doesn't know what an acronym means, they will 
usually be too embarrassed to ask—or they may make an incorrect assumption. 
In that event, you've lost the ability to fully communicate to everyone. Spell out 
acronyms on each page at least once. This can be the first time you use it or you 
can have a footnote at the bottom of the page defining acronyms or specialized 
language. This isn't about dumbing anything down, but rather not making things 
more complicated than necessary. By the way, "ltv_bin" on the x-axis represents 
the loan-to-value ratio. This is commonly referred to as LTV and represents the 
loan amount relative to the value of the property (typically expressed as a percent, 
but here we have it as a decimal). The higher the LTV, the riskier the loan, because 
the higher proportion the loan amount is relative to the value of the property. UPB 
is unpaid principal balance: the sum of total outstanding loans. 

There is also some pretty convoluted language in the title of the graph and at the 
bottom of it. I guarantee you that the person who created this graph knows exactly 
what it all means. I can decipher enough to believe that it indicates the product in 
the title and what they used for their out of time sample to validate the model. Who 
our audience is will dictate whether and how prominently we need to present this. 
If we're reporting to a senior leadership team, we probably don't need to get into 
any of those details—they are going to trust that we know our stuff and have done 
it in a way that makes sense. If we're communicating to people who will care about 
the technical details, we may need to include some of this, but it's likely footnote 
material rather than things that are prominently called out as in the original. 
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Change up line style sparingly. Dotted lines are super attention grabbing. They 
also add some visual noise. If we think about a dotted line from a clutter stand¬ 
point, we've taken what could have been a single visual element (a line) and 
chopped it into many pieces. Because of this, I recommend reserving the use of 
dotted lines for when you have uncertainty to depict: a forecast, prediction, target 
or goal. In these cases, the visual sense of uncertainty we get with the dotted or 
dashed line more than makes up for the additional noise it adds. The blue mod¬ 
el line in Figure 7.3a is the perfect use of a dotted line. When it comes to the 
green UPB line, though—I certainly hope we aren't estimating the volume of un¬ 
paid principal balance across our portfolio—we should know exactly what that is! 
Use thick, solid, filled in elements to depict actual data and thin, dotted, unfilled 
points to represent estimated data. 

Eliminate the secondary y-axis. I recommend avoiding the use of a secondary 
y-axis, both in this specific example and in general. The challenge with a second¬ 
ary axis is—no matter how clearly things are titled and labeled—there is always 
some work that has to be undertaken to figure out which data to read against 
which axis. I don't want my audience to have to do this work. Rather than use a 
secondary axis, you can hide the secondary y-axis and instead title and label the 
data that is meant to be read against it directly. As another alternative, you can 
create two graphs, using the same x-axis across each. Putting the data into sep¬ 
arate graphs means you can title and label each of the data series on the left, so 
there's no question of "Do I look left or right to get the details that I need?" 

The data that is meant to be read against the secondary axis is the unpaid princi¬ 
pal balance. This is shown in an odd way in Figure 7.3a: thousands of thousands. 
A thousand thousand is a million. Changing our scale to millions will both make 
the graph easier to process and talk about the data it depicts. 

It seems like the general shape of the data is more important than the specific nu¬ 
meric values. Given this, I'd recommend employing the second alternative raised 
previously: divide the data across two graphs, as shown in Figure 7.3b. 
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Model performance by LTV 

LU 25% 

2 

H 20% 

LU 

5 15% 

a. 

LU 

<£ io% 

5% 
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FIGURE 7.3b Pull the data apart into two graphs 

In Figure 7.3b, I separated the visual into two graphs. The top graph shows Model 
and Actual prepayment rate. The LTV axis in the middle is meant to be read across 
both graphs (if you find this confusing, you could simply repeat it again below the 
second graph). The bottom graph shows the distribution of loans by Unpaid Prin¬ 
cipal Balance in our portfolio. We'll look at another potential option for presenting 
this data momentarily. First, let's continue my points of feedback. 

The big red box doesn't highlight the right thing. Someone looked at this graph 
and thought, I would like you to look here and then they drew a red box around it. 

I appreciate the effort. However, I think this might be a red herring. 

Now that we've hopefully answered a lot of the questions about this data, take a 
look back at the red box in Figure 7.3a. What are we meant to see? Can you state 
it in a sentence? 

If I were to do so, my sentence would be: our model overpredicts prepayment at 
low LTVs. Is that an issue? Look back at Figure 7.3a. Do we have any loans in our 
portfolio at low LTVs? (Hint: look at the green dotted line to answer this question.) 

No. We don't have any loan balance in that part of the portfolio. That's probably 
why our model isn't performing well there: we didn't have any loans to model on 
in this area. Beyond that, low LTVs represent our least risky loans. These are cases 
where the loan amount is low compared to the property value (so if someone 
doesn't pay and the bank needs to take the house, they will make their money 
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back...and then some). This probably isn't cause for concern. That said, there is 
still something interesting here. We'll get to that momentarily. 

STEP 3: Let me walk you through how I would present this data in a live setting, 
which—as we saw in the prior exercise—gives us some interesting options for 
communicating this data. 

I can start by setting up the graph for my audience. Today, we'll be looking at 
modeled versus actual prepayment rates by LTV. Prepayment rate is shown on the 
vertical y-axis. Loan to value, LTV, is depicted across the x-axis. (Figure 7.3c) 

Model performance by LTV 
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FIGURE 7.3c First, set the stage 


Actual prepayment doesn't vary much by LTV: this line is pretty flat. (Figure 7.3d) 

Model performance by LTV 
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FIGURE 7.3d Actual prepayment doesn't vary by LTV 
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Our model, however, behaves differently. It overpredicts at low LTVs and under¬ 
predicts at high LTVs. (Figure 7.3e) 


Model performance by LTV 
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FIGURE 7.3e Model overpredicts at low LTVs and underpredicts at high LTVs 



Next, I'm going to do something a little different. You might ask: how big of a 
deal is this? Where are loans concentrated in the portfolio? I'm going to replace 
prepayment rate on the y-axis with the unpaid principal balance of loans across 
our portfolio. That looks like this. (Figure 7.3f) 


Model performance by LTV 
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LOAN TO VALUE (LTV) 

FIGURE 7.3f Distribution of loans across the portfolio 


Here is how the loans are distributed across the portfolio. Let's pause and take 
note of the y-axis scale and how the data lines up against it: the biggest bar rep¬ 
resents roughly $800 million in unpaid principal balance. That said, more import- 
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ant than the specific numbers is that we focus on the general shape of the data, so 
I'm going to get rid of this axis in my next step. At the same time, I'll push these 
bars to the background and layer the modeled and actual prepayment rates back 
onto the graph. (Figure 7.3g) 

Model performance by LTV 
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FIGURE 7.3g Add prepayment back to graph 
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This allows us to see that the model over-predicts prepayment at low LTVs—but 
we don't have any portfolio concentrated there. (Figure 7.3h) 


Model performance by LTV 
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LOAN TO VALUE (LTV) 

FIGURE 7.3h Model over-predicts prepayment at low LTVs 


The model under-predicts prepayment at high LTVs —by the way, we do have loan 
balances in that part of the portfolio. We should watch this going forward. (Figure 7.3i) 
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Model performance by LTV 



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

LOAN TO VALUE (LTV) 

FIGURE 7.3i Model under-predicts prepayment at high LTVs 


The preceding sequence would work well in a live presentation. If we need a single 
visual to send out to remind people what we discussed or for those who missed the 
meeting, I would annotate important points directly on the slide. This will make it clear 
to my audience what they are meant to take away and what it means. See Figure 7.3j. 


Prepayment model has limitations 


The prepayment model performs well in the LTV range where most 
of our portfolio loans are concentrated. 
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However, the model 
overestimates 
prepays at low 
LTVs and under- 
predicts at high 
LTVs. This is a 
model limitation. 

ACTION: avoid 
concentrating 
portfolio in low or 
high LTV loans. 


LOAN TO VALUE (LTV) 


FIGURE 7.3j Annotate important points directly on slide 

Lessons put into practice: use accessible language and don't overcomplicate. 
Highlight sparingly with color. Articulate your message clearly so your audience 
doesn't miss it! 
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Exercise 7.4: back-to-school shopping 

As a data analyst at a national clothing retailer, you are gearing up for this year's 
back-to-school shopping season. You've analyzed survey data from last year's 
back-to-school shopping to understand customers' experience—what they liked 
and what they didn't like. You believe the data reveals some clear opportunities 
and want to use it to inform the strategy for this year's back-to-school shopping 
season across your company's stores. 

This example may sound familiar; we've seen it before in Exercises 1.2, 1.3, 1.4, 
1.7, 6.3 and 6.4. Refer back to these exercises and corresponding solutions to 
remind yourself how we thought about our audience, Big Idea, storyboard, ten¬ 
sion, resolution, and the narrative arc for this scenario. Review Figure 7.4a and 
complete the following. 


Back-to-school shopping survey results 


% FAVORABLE 


STORE OFFERS... 

Our store 

All stores 

The store is well-organized. 

40% 

38% 

Fast and easy checkout. 

33% 

34% 

Friendly and helpful employees. 

45% 

50% 

Good promotions. 

45% 

65% 

1 can find what I’m looking for. 

46% 

55% 

1 can find the size 1 need. 

39% 

49% 

A nice atmosphere. 

80% 

70% 

Latest technology for easy shopping. 

35% 

34% 

Lowest sales prices. 

40% 

60% 

A wide selection. 

49% 

47% 

Items 1 can't find elsewhere. 

74% 

54% 

The latest styles. 

65% 

55% 


FIGURE 7.4a Back-to-school shopping survey results 


STEP 1 : What is the story here? How would you visualize the data in Figure 7.4a to 
lend insight into what we should focus on in this situation? Reflect on all of the les¬ 
sons that we've covered and how you would apply them. Make assumptions about 
the scenario as needed. Download the data and create your preferred visual(s). 


STEP 2: You will be walking your audience through this data in a meeting. How 
would you present the information to them? Develop your materials in the tool 
of your choice. 


STEP 3: You anticipate that your audience will want you to send them content after 
the meeting. This will remind those who attended what was talked about as well 
as let folks who weren't able to attend know what was discussed. How would you 
design graphs or slides to meet this need? Create in the tool of your choice. 
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Solution 7.4: back-to-school shopping 

STEP 1: Let's look at a few iterations of this data to see which can help us facili¬ 
tate that magical "ah ha" moment of understanding what graphs done well can 
do. First, I'll try a scatterplot. See Figure 7.4b. 


Survey Results 



FIGURE 7.4b Scatterplot 

The scatterplot seems to prompt more questions than it answers; it doesn't work 
well for this data. Let's try a line graph. See Figure 7.4c. 


Survey Results 



FIGURE 7.4c Line graph 
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Do lines work for us here? We can more easily pick out highs and lows than the prior 
view. But the lines are connecting categorical data in a way that doesn't make sense. 
Given that we have categorical data, let's try a bar chart. See Figure 7.4d. 



FIGURE 7.4d Vertical bars 


The most frequent reason I find myself moving from a standard vertical bar chart 
to a horizontal bar chart is simply to get more space to write the x-axis labels. Di¬ 
agonal elements are attention-grabbing. They are also messy: they create jagged 
edges, which look disorganized. Worse than any of that, though—diagonal text is 
slower to read than horizontal text. This is an easy fix: we can rotate our graph 90 
degrees to a horizontal bar chart, which gives us more space to write the category 
names in a legible fashion. See Figure 7.4e. 
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Survey Results 


The latest styles. 
Items I can’t find elsewhere. 

A wide selection. 
Lowest sales prices. 
Latest technology for easy.. 
A nice atmosphere. 

I can find the size I need. 

I can find what I'm looking for. 

Good promotions. 
Friendly and helpful employees. 
Fast and easy checkout. 
The store is well-organized. 







■ All stores 

■ Our store 


FIGURE 7.4e Horizontal bars 

Any time we show data, we want to be thoughtful about how we order the data. 
Sometimes, there is a natural ordering inherent to our categories that we should 
honor. If we don't have a set order to our categories, then we want to order mean¬ 
ingfully by the data. In doing so, we should think back to that zigzagging "z" of 
processing: without other visual cues, your audience will start at the top left of 
your page or screen and do zigzagging "z's" with their eyes to take in the informa¬ 
tion. This means they encounter the top left of your graph first. If the small pieces 
are the important ones, we might put those at the top. See Figure 7.4f. 


Survey Results 


Fast and easy checkout. 
Latest technology for easy 
I can find the size I need. 
The store is well-organized. 
Lowest sales prices. 
Friendly and helpful employees. 
I can find what I’m looking for. 

Good promotions. 
A wide selection. 
The latest styles. 
Items I can’t find elsewhere. 
A nice atmosphere. 



■ All stores 

■ Our store 


FIGURE 7.4f Sort ascending 


PRACTICE TwA coLE 
















PRACTICE A iAjtil COLE 


318 


practice more with cole 


However, if we step back and think about story progression, we'd be starting with 
where we perform the worst, which could be a bit abrupt. Perhaps we want to 
start with where we perform well and then move into the opportunities: we could 
put the large categories at the top and sort in descending fashion. See Figure 7.4g. 


Survey Results 


rA° .{S 


A nice atmosphere. 
Items I can’t find elsewhere. 

The latest styles. 
A wide selection. 
Good promotions. 

I can find what I’m looking for. 
Friendly and helpful employees. 

Lowest sales prices. 
The store is well-organized. 

I can find the size I need. 
Latest technology for easy.. 
Fast and easy checkout. 
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■ Our store 

■ All stores 


FIGURE 7.4g Sort descending 

Excel moved my x-axis to the top with this reorganization in Figure 7.4g, which I 
like. It means my audience hits how to read the data before they get to the data. 

In the spirit of applying the other lessons we've covered: next, let's declutter. Before 
reading on, spend a moment studying Figure 7.4g. What clutter would you elim¬ 
inate? What other changes would you make to ease the processing of this data? 
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Survey Results 

■ Our store "All stores 


0% 20% 40% 60% 80% 


A nice atmosphere 
Items I cant find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 



100% 


FIGURE 7.4h Decluttered graph 


Figure 7.4h represents my decluttered graph. I removed the chart border and gri¬ 
dlines. I thickend the bars. I increased the x-axis maximum to 100% and reduced 
the frequency of x-axis labels so they would fit horizontally. I removed the y-axis 
line and tick marks. I eliminated the periods from the ends of the y-axis labels and 
shortened the second-to-last label so it would fit on a single line. I put the legend 
at the top of the graph, so my audience will encounter it before they get to the 
data and used similarity of color to visually tie it to the data it describes. 


Before proceeding, look back at Figure 7.4h: where are your eyes drawn? 

If you're like me, your response is: nowhere very clearly. This means we aren't 
currently using our preattentive attributes strategically to direct attention. Let's be 
more thoughtful in how we use our color and contrast. See Figure 7.4i. 
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Survey Results 

■ Our store ■ All stores 


0% 20% 40% 


A nice atmosphere 
Items I can’t find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 



60% 


80% 


100% 


FIGURE 7.4i Focus attention 


In Figure 7.4i, I've pushed most elements of the graph to the background by mak¬ 
ing them grey. I drew attention to Our Store by making it dark blue. We'll further 
focus attention in a few different places when we tell our story in a live progression 
momentarily. First, let's add the words that need to be present to ensure the data 
is accessible: see Figure 7.4j. 


Back-to-school shopping: consumer sentiment 

■ Our store ■ All stores 


STORE OFFERS... % FAVORABLE 

0% 20% 40% 60% 80% 100% 


A nice atmosphere 
Items I can’t find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I’m looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 



FIGURE 7.4j Add words 

Stories have words. At minimum, we need descriptive words on the graph to help 
it make sense: a graph title and axis titles. We can take this a step further and use 
our words to tell a story. Let's do that next. 
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STEP 2: If I were presenting in a live meeting, I might go through a progression 
similar to the following. 

I'll be making one recommendation today: that we invest in employee training to 
improve the in-store customer experience. (Figure 7.4k) 


Let’s invest in employee training to 
improve the in-store customer experience 


FIGURE 7.4k My Big Idea summarized in a pithy, repeatable phrase 

Let me back up and set the plot. The back-to-school shopping season makes up 
nearly a third of our annual revenue, so is a huge driver of our overall success. But 
we've not historically been data-driven about how we've approached it. \Ne al¬ 
lowed a one-off compliment or criticism at the store level drive how we did things. 
That worked okay when we were small, but clearly it doesn't scale. So we thought: 
let's get more data-driven about how we plan for this important part of our busi¬ 
ness. Coming out of last year's back-to-school shopping season, we conducted a 
survey of our customers and the customers of our competitors. The data collected 
lends important insight into both how we fare across different dimensions of our 
store experience, as well as how we stack up against the competition. (Figure 7.41) 


Back-to-school shopping accounts for 

30 % 

of our annual revenue. Because of this, it is a 
huge driver of our overall annual success. 

Data source: monthly Sales report. Based on prior three years (2017. 2018, 2019) of annual back-to- 
school sales compared to total annual sales. 


FIGURE 7.41 Back up and set the plot 
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Today I'll take you through those survey results and use them to frame up a spe¬ 
cific recommendation. I've already foreshadowed this: I believe we should invest 
in employee training to improve the in-store customer experience. (Figure 7.4m) 


What we’ll cover today 


1 Discuss what we’ve learned 

from our survey analysis 1 and 

2 Suggest specific recommendations 

on changes to make for the upcoming back- 
to-school shopping season to improve 
customer satisfaction and increase sales. 

iprehensive details on survey methodology and related info can be found in the Appendix on pages 


FIGURE 7.4m What we'll cover today 

Before we get to the data, let me set up for you what we're going to be looking 
at. We asked people about a number of dimensions of the shopping experience — 
things like the store offers a nice atmosphere, items I can't find elsewhere, and 
the latest styles. For each of these dimensions, we'll be summarizing into percent 
favorable. This is the proportion of respondents who indicated positive sentiment 
on the given item. (Figure 7.4n) 

Back-to-school shopping: consumer sentiment 

STORE OFFERS... % FAVORABLE 

0% 20% 40% 60% 80% 100% 

A nice atmosphere 
Items I can't find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 


FIGURE 7.4n Set up the graph 
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Let's add the data for our stores. You'll see there is variance in performance across 
the different items. (Figure 7.4o) 

Back-to-school shopping: consumer sentiment 

STORE OFFERS... % FAVORABLE 

0% 20% 40% 60% 80% 100% 

A nice atmosphere 
Items I can't find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 

FIGURE 7. 4o Focus on our business 

Let's focus first on where things are going well. \Ne score highest in three areas: a 
nice atmosphere, items people can't find elsewhere, and the latest styles. Verba¬ 
tim comments echoed these points as well: people like the idea of shopping with 
us and they have positive brand association. (Figure 7.4p) 

Back-to-school shopping: consumer sentiment 


STORE OFFERS... % FAVORABLE 

0% 20% 40% 60% 80% 100% 


A nice atmosphere 
Items I can't find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 



FIGURE 7.4p Focus on highest scoring items 


PRACTICE / W4 coLE 

















PRACTICE A iAjtil COLE 


324 


practice more with cole 


But there is also another side of the story: items where we score lower. (Figure 7.4q) 

Back-to-school shopping: consumer sentiment 


STORE OFFERS... % FAVORABLE 

0% 20% 40% 60% 80% 100% 


A nice atmosphere 
Items I can't find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 


FIGURE 7.4q Focus on lowest scoring items 


Interestingly, when we layer on our competitor data—which I'll do next via grey 
bars -—we score on par with the competition in these low scoring areas. So that's 
not where we recommend focusing. (Figure 7.4r) 


Back-to-school shopping: consumer sentiment 

■ Our store ■ All stores 


STORE OFFERS... % FAVORABLE 

0% 20% 40% 60% 80% 


A nice atmosphere 
Items I can't find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 



100 % 


FIGURE 7.4r Add competitor data 
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There are other items, however, where we score lower than the competition. (Figure 7.4s) 


Back-to-school shopping: consumer sentiment 

■ Our store ■ All stores 


STORE OFFERS... % FAVORABLE 

0% 20% 40% 60% 80% 100% 


A nice atmosphere 
Items I can't find elsewhere 
The latest styles 
A wide selection 
Good promotions 
I can find what I'm looking for 
Friendly and helpful employees 
Lowest sales prices 
The store is well organized 
I can find the size I need 
Latest technology 
Fast and easy checkout 




FIGURE 7.4s Highlight where we underscore competition 


Next, I'm going to transition to a different view of the data. Rather than plot 
absolute percent favorable, I'm going to graph the difference between the bars. 
The left-hand side represents where we underperform —we score lower than—the 
competition. The right-hand side shows where we outperform—we score higher 
than—the competition. (Figure 7.4t) 


Back-to-school shopping: consumer sentiment 

UNDERPERFORM | OUTPERFORM 
STORE OFFERS... DIFFERENCE: % FAVORABLE VS. COMPETITORS 

-20% -10% 0% 10% 20% 


Items I can't find elsewhere 
A nice atmosphere 
The latest styles 
A wide selection 
The store is well-organized 
Latest technology 
Fast and easy checkout 
Friendly and helpful employees 
I can find what I'm looking for 
I can find the size I need 
Good promotions 
Lowest sales prices 


I 

I 


FIGURE 7.4t Shift from focus on absolute to difference 
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Let's refocus again, first on where things are going well. The three areas we out¬ 
perform the competition the most—items that can't be found elsewhere, a nice 
atmosphere, and the latest styles—these are the same three items we score high¬ 
est on an absolute percent favorable basis. (Figure 7.4u) 


Back-to-school shopping: consumer sentiment 

UNDERPERFORM | OUTPERFORM 

STORE OFFERS... DIFFERENCE: % FAVORABLE VS. COMPETITORS 


- 20 % 


- 10 % 


Items I can't find elsewhere 
A nice atmosphere 
The latest styles 
A wide selection 
The store is well-organized 
Latest technology 
Fast and easy checkout 
Friendly and helpful employees 
I can find what I'm looking for 
I can find the size I need 
Good promotions 
Lowest sales prices 




FIGURE 7.4u Focus on items we outperform 


But there are also areas where we underperform the competition. (Figure 7.4v) 


Back-to-school shopping: consumer sentiment 

UNDERPERFORM | OUTPERFORM 

STORE OFFERS... DIFFERENCE: % FAVORABLE VS. COMPETITORS 

-20% -10% 0% 10% 20% FAV 


Items I can't find elsewhere 
A nice atmosphere 
The latest styles 
A wide selection 
The store is well-organized 
Latest technology 
Fast and easy checkout 
Friendly and helpful employees 
I can find what I'm looking for 
I can find the size I need 
Good promotions 
Lowest sales prices 



I 

I 



80% 

65% 

49% 

40% 

35% 

45% 

46% 

39% 

45% 

40% 


FIGURE 7.4v Focus on items we underperform 
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We underperform the most in items related to promotions and sales. These are 
areas we've intentionally avoided historically because of the brand dilution we 
expect may result. We don't recommend focusing here. (Figure 7.4w) 


Back-to-school shopping: consumer sentiment 

UNDERPERFORM | OUTPERFORM 

STORE OFFERS... 


- 20 % 

Items I can’t find elsewhere 
A nice atmosphere 
The latest styles 
A wide selection 
The store is well-organized 
Latest technology 
Fast and easy checkout 
Friendly and helpful employees 
I can find what I’m looking for 
I can find the size I need 
Good promotions 
Lowest sales prices 


DIFFERENCE: % FAVORABLE VS. COMPETITORS 

% 

-10% 0% 10% 20% FAV 




74% 

80% 

65% 

49% 

40 % 

35% 

33% 


45% 

40% 


FIGURE 7.4w Underperform the most in promotions 


Rather, look at these other areas where we underperform. Friendly and helpful 
employees, I can find what I'm looking for, and I can find the size I need—it is 
alarming that we underscore the competition so much in these areas. The good 
news is that these are all aspects of customer experience over which our sales 
associates have direct control. (Figure 7.4x) 


Back-to-school shopping: consumer sentiment 

UNDERPERFORM | OUTPERFORM 
STORE OFFERS DIFFERENCE: % FAVORABLE VS. COMPETITORS 

% 

-20% -10% 0% 10% 20% FAV 


Items I can’t find elsewhere 
A nice atmosphere 
The latest styles 
A wide selection 
The store is well-organized 
Latest technology 
Fast and easy checkout 
Friendly and helpful employees 
I can find what I’m looking for 
I can find the size I need 
Good promotions 
Lowest sales prices 



I 
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74% 

80% 

65% 

49% 

40% 

35% 

33% 

45% 

46% 

39% 


FIGURE 7.4x Recommend focusing on areas we can control 
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Let's invest in employee training to create a common understanding of what good 
service looks like to improve the in-store customer experience and make the up¬ 
coming back-to-school shopping season the best one yet! (Figure 7.4y) 


Let’s invest in employee training to 
improve the in-store customer experience 


FIGURE 7.4y Repeat my Big Idea summarized in a pithy, repeatable phrase 
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STEP 3: If I need something to send out after my live presentation, I would fully 
annotate a slide so that my audience processing it on their own would get a similar 
story to what I took my audience through in the live progression. See Figure 7.4z. 


Action needed: invest in employee training 


Back-to-school shopping: consumer sentiment 


UNDERPERFORM | OUTPERFORM 
DIFFERENCE: % FAVORABLE VS. COMPETITORS 


STORE OFFERS. -20% 

Items I can't find elsewhere 
A nice atmosphere 
The latest styles 
A wide selection 
The store is well-organized 
Latest technology 
Fast and easy checkout 
Friendly and helpful employees 
I can find what I'm looking for 
I can find the size I need 

Good promotions | 
Lowest sales prices | 


- 10 % 



45% 

46% 


THE GOOD NEWS 

We're beating the competition 
when it comes to the latest 
styles that people can’t find 
elsewhere and store 
atmosphere 

WE CAN IMPROVE: 

We score low and lower than 
the competition in areas 
related to helpful employees 
and customers being able 
to find what they are 
looking for. We also score 
lower than the competition on 
promotions/sales, but don’t 
recommend focusing here. 

RECOMMENDATION: 

Invest in employee 
training to improve 
customer experience. 


FIGURE 7.4z Final annotated version 

This is a good illustration of the power of applying the many lessons that we've 
covered: building a robust understanding of the context, choosing an appropri¬ 
ate visual display, identifying and eliminating clutter, drawing attention where we 
want it, thinking like a designer, and telling a story. Don't just show data: make 
data a pivotal point in an overarching story! 
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Exercise 7.5: diabetes rates 

The following case study was created and solved by storytelling with data team 
member Elizabeth Hardman Ricks. 

Imagine you work as an analyst for a large health care system with medical cen¬ 
ters in several states. Your role is to use data to understand trends in the patient 
base and communicate your findings to help administrators make organizational 
decisions. Your analysis has shown a recent rise in diabetes rates across all medical 
centers (A-M) in a given region. If this trend continues at its current rate, the centers 
may be understaffed to provide an appropriate level of care. Specifically, you've 
estimated the increase will be an additional 14,000 patients per year for the next 
four years. You want administrators to understand the trend in diabetes rates and 
use that information to determine whether additional resources are needed. 

You're planning to share this analysis at an upcoming meeting. You've visualized 
the diabetes rates four different ways, as shown in Figure 7.5a. Spend a moment 
familiarizing yourself with the data then complete the following steps. 


Diabetes rates by medical center: 4 views of the same data 

OPTION A: Bars OPTION B: Separate lines 



OPTION C: Standard line graph 


OPTION D: Slopegraph 



FIGURE 7.5a Diabetes rates by medical center 
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STEP 1: Let's start by considering our audience. The decision maker is a senior 
administrator. Because this is an anonymized example, we won't aim to pinpoint 
the needs of a specific person; rather, we can think generally about what motivat¬ 
ing factors someone in this role may have. What would keep them up at night? 
What might motivate them? Spend a few minutes brainstorming and make a list. 

STEP 2: Create the Big Idea for your communication (if helpful, refer to the Big 
Idea worksheet in Exercise 1.20). Feel free to liberally make assumptions as need¬ 
ed for the purpose of the exercise. 

STEP 3: Next, let's think about the narrative arc. What tension exists for the au¬ 
dience? What does your analysis suggest that resolves this tension? What pieces 
of content will you need to provide your audience? With this in mind, create a 
storyboard (if helpful, refer to Exercises 1.23 and 1.24) and arrange your pieces of 
content along the narrative arc (see Exercise 6.14). 

STEP 4: Review the four graphs in Figure 7.5a. Analyze each and observe what 
it allows you to most easily see in the data. Write a one-sentence observation for 
each graph. Think back to the Big Idea you crafted in Step 2: which of these ap¬ 
proaches reinforces your message best? 

STEP 5: Assume you have a tight timeline to communicate your findings. A key 
stakeholder has asked for an update by the end of the day today. Refer to the 
graph you identified as working best in Step 4. Assume you don't have time to 
change anything about the layout of the graph. How could you use color and 
words to make the main takeaway clear? Download the data and make these 
changes to your selected graph. 

STEP 6: Your visual from Step 5 was well received (nice work!). Administrators 
would like to discuss the data at an upcoming meeting where your manager will 
present the full analysis, including your forward-looking projections that diabetes 
rates will continue to increase. Create the deck that your manager will use in your 
tool of choice to tell a story with this data. Provide the accompanying narrative as 
speaker notes for each slide. 
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Solution 7.5: diabetes rates 

STEP 1: In brainstorming what my audience might care about, I set a timer and 
wrote down as many ideas as I could in five minutes. When I stepped back and 
looked at my list, I realized I could group what I'd come up with into five categories: 

1. Financial: controlling operating expenses, hitting revenue targets 

2. People: recruiting providers, managing and retaining talent to deliver quality 
patient care 

3. Accreditation and standards: remaining within certain benchmarks, navigat¬ 
ing government regulations 

4. Suppliers: maintaining reimbursement levels from insurance companies, ne¬ 
gotiating contracts, purchasing medical equipment 

5. Competitors: maintaining a superior level of patient care and/or cost com¬ 
pared to other facilities and patient options 

STEP 2: As I worked through the Big Idea worksheet, the motivating factors from 
my list in Step 1 helped me narrow in on what's at stake for my audience in this 
specific circumstance. My audience stands to lose revenue (reimbursement from 
payers) and fall below accreditation standards if the patient care does not meet a 
certain threshold. To mitigate this risk, I will ask them to think about hiring addi¬ 
tional resources to meet the growing demand for diabetes care. 

My Big Idea for my communication is: 


l/Ve s hot*Id consider Airing additional staff to care -for the 
projected increase in diabetic patients so that me dont lose 
revenue and remain wilhin national accreditation standards. 

STEP 3: While my audience has tension coming from several places (from my 
list in Step 1, it's a wonder they sleep at night!), I'd consider the financial impli¬ 
cations to be a strong source of tension. Without revenue coming in, eventually 
the system would shut down. This analysis shows one way to remain afloat: staff 
accordingly to provide the appropriate level of care. 

My initial storyboard is shown in Figure 7.5b. Notice I arranged these stickies 
chronologically. This feels most natural to me because it mirrors the steps I fol¬ 
lowed in my analytical process. 
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answer related 
g west ions 


£v plain 
forecast 
me-thodology, 

assumptions, 
etc. 



IF trends 
continue, 
we will be 
understaffed 



Recommend: 
Hire additional 
staff f o provide 
appropriate care 
and remain 
accredited. 


FIGURE 7.5b My initial storyboard 


However, I'll want to consider my audience's perspective. I can use the narrative 
arc to arrange these stickies to align to how the data resolves their tension, as 
shown in Figure 7.5c. 
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and remain 
accredited’ 


FIGURE 7.5c My storyboard arranged along the narrative arc 


PPACTicE'Wf coLE 














































334 


practice more with cole 


uu 
—1 



STEP 4: When I look at the four graphs in Figure 7.5a, it's interesting how differ¬ 
ent views enable us to see certain things about the data more clearly. Here are my 
one-sentence observations about each graph: 

1. OPTION A: Center A has the lowest rate while B has the highest rate. 

2. OPTION B: Every line is sloping upward with varying degrees of change. 

3. OPTION C: Every line is sloping upward with Center A lowest (about 3%) 
and a marked increase in Center E between 2017-2019 (roughly 8% to 11 %). 

4. OPTION D: Center E increased the most (from roughly 8% in 2015 
to 11% in 2019); Center A remains lowest (slightly above 4% in 2019). 

Which graph will help my audience understand my Big Idea the best? I selected 
Option C, the standard line graph, for three primary reasons (although I'll definite¬ 
ly need to make some design changes—namely color and clutter—before pre¬ 
senting it). First, this view provides sufficient historical context, which I'll need my 
audience to see to ground them in what has happened and how this affects future 
expectations. Second, the line graph makes sense for this data over time and will 
feel familiar to my audience, so there won't be any obstacles to understanding the 
graph. Finally, I want to highlight the line indicating the diabetes rate across all 
of our medical centers in my final communication and show the projection going 
forward. This visual, with some modifications, will allow me to easily do that. 

STEP 5: Time constraints are real. Fire drill requests happen, so I'll need to prioritize 
what changes I can make for the biggest impact. Due to time constraints, I'll skip 
making any modifications to the layout of the graph, but instead will make changes 
when it comes to color and use of words. Figure 7.5d shows what this could look like. 


Diabetes rates have increased 

Do we need additional staff to remain within standards? 



CENTERS OF CONCERN: Center B has 
the highest rate (12.5%); Center E has 
increased the most in the past four years. 
Next steps: further investigate specifics 
through demographic analysis at the 
medical center level and create action plans. 

OVERALL: The diabetes rate across our 
medical centers has increased steadily to 
8.6% in 2019. If this pace continues, we 
estimate an additional 14,000 diabetic 
patients per year for the next four years. 

THE GOOD NEWS: Diabetes rate among 
our patients remains lowest in Center A. 

Next steps: further analysis to understand 
relative level of care and what we might 
make use of in other areas. 


FIGURE 7.5d My visual completed for end-of-day fire drill request 
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I chose orange to emphasize the negatives: where diabetes rate is highest and in 
which center it increased the most. I utilized black to tie my title ("Diabetes rates 
have increased") to the data it describes (All). The subtitle acts to both illuminate 
the tension and suggest how the audience can resolve it. I picked blue to ac¬ 
centuate the positive: it's not all doom and gloom! I added text at the right (tied 
through both proximity and similarity of color to the data it describes) with some 
additional context and to help my audience understand why I've drawn attention 
as I have. 

In a time-constrained environment, the contextual considerations we took in Steps 
1 and 2 become even more valuable. Because I'd already done these thought ex¬ 
ercises, I was able to create the visual in Figure 7.5d in under 15 minutes. 

STEP 6: Figures 7.5e - 7.5p show the materials I would build and speaker notes 
for my manager to present this data story. 

Today, I'd like you to contemplate an alarming number: 14,000. This is the number 
of additional diabetic patients per year we'll have if the increasing current trend in 
diabetes rates across our medical centers continues. I'll walk you through the de¬ 
tails of how we arrived at that number momentarily, but keep in mind that our pri¬ 
mary goal today is to discuss whether—given this anticipated increase in patient 
needs—we should consider hiring additional staff to remain within accreditation 
standards of appropriate care. (Figure 7.5e) 


A question to ponder... 


Can our current staffing levels handle an 

additional 14,000 diabetic patients 

per year for the next four years? 


FIGURE 7.5e A question to ponder 
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Let me first talk you through the historical trends. We'll be looking at diabetes 
rates—expressed as a percent of our total patient base —at the medical center 
level from 2015 through 2019. (Figure 7.5f) 


Let's set the stage 


Diabetes rates by medical center 
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FIGURE 7.5f Start by setting the stage 


Let's look across all of our medical centers: overall diabetes rate among our pa¬ 
tients in 2015 was 7 .2%. (Figure 7.5g) 


Across all medical centers: 7.2% in 2015 


Diabetes rates by medical center 

[2 14% -I 
z 

LU 

12 % 

Q. 

o 10% 

8 % 

6 % \ 

4% \ 

2 % 

0% -t -r-r - -i 

2015 2016 | 2017 2018 | 2019 


FIGURE 7.5g Overall diabetes rate was 7.2% in 2015 
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At that point, there were eight centers with higher diabetes rates (Figure 7.5h) 


8 centers: above overall rate 


Diabetes rates by medical center 


£2 14% 

z 

tu 

5 12 % 
CL 

° 10 % 

cS 

8 % 


6 % 


4% 


2 % 



2015 

2016 

2017 

2018 

2019 


FIGURE 7.5h There were 8 centers with rates higher than overall 
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...and five centers with lower diabetes rates. (Figure 7.5i) 


5 centers: below overall rate 
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FIGURE 7.5i There were 5 centers with relatively low diabetes rates 
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We've seen a steady increase in the overall diabetes rate in our patients over the 
past five years. Today, it is 8.6%. (Figure 7.5j) 


Across all medical centers: now 8.6% 


Diabetes rates by medical center 

f2 14% 
z 
uj 

£ 12% 

Q. 

o 10% 

8 % 

6 % 

4% 


2 % 


0/ ° 2015 . 2016 I 2017 ! 2018 I 2019 


FIGURE 7.5j Diabetes rate across our medical centers is 8.6% in 2019 


Over this period, all eight of the higher medical centers have increased. (Figure 7.5k) 


Increase: high centers 
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FIGURE 7.5k Medical centers having relatively high diabetes rates increased 
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Those with relatively lower diabetes rate also all increased. (Figure 7.51) 


Increase: low centers 
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FIGURE 7.51 Those centers having lower diabetes rates also increased 


The overall rate has increased roughly 0.5 percentage points per year. (Figure 7.5m) 


Consistent increase of 0.5 points per year 
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FIGURE 7.5m This is a consistent rise of 0.5 points per year 
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We forecast diabetes rates forward at the medical center level. I'm happy to talk 
about our methodology more specifically if there's interest in that. But the overall 
takeaway is that if a similar pace of increase continues, we project the diabetes rate 
across our medical centers will be 1 0% by the year 2023. In other words, one out of 
every ten patients across our clinics will be diabetic. (Figure 7.5n) 


Projected to be 10% by 2023 
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FIGURE 7.5n We project a continued increase 


That translates to an additional 14,000 diabetic patients per year for the next four 
years. Given these projections, what should we do to prepare for this? Our initial 
recommendation is to consider hiring additional staff to be able to handle these 
numbers without any dip in patient care. What other options should we be thinking 
about? Let's discuss. (Figure 7.5o) 


Implications: +14,000 patients per year 


Diabetes rates by medical center 



FIGURE 7.5o This implies 14,000 more patients with diabetes per year 
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If I needed something that would be sent around, I could have a single fully anno¬ 
tated slide that could stand on its own. Figure 7.5p shows what that might look like. 


Rising diabetes rates: do we need additional staff? 
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Center B has the highest rate 
(12.5%) and Center E has seen a 
marked increase between 2017 
(8.5%) and today (11.3%). What 
factors are influencing these levels? 

The diabetes rate across all centers 
has increased from 7.2% in 2015 to 
8.6% in 2019. At the current pace, this 
will increase to 10% by 2023. This 
implies an additional 14,000 patients 
per year for the next four years. 

The good news is that we have an 
opportunity to leam what factors are 
influencing Center A, which has the 
lowest rate and top patient care. 

Next steps: Let's determine if these 
factors can be applied broadly. 


FIGURE 7.5p Annotated slide to distribute 


In this scenario, we've pulled from practice exercises in Chapters 1,2, 4, and 6 to 
craft a compelling story that should resonate with our audience and help us direct 
a discussion focused on action! 
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Exercise 7.6: net promoter score 

Imagine you work as an analyst on the customer insights team in your organiza¬ 
tion, which has three primary products. There is a monthly update meeting where 
the product team reviews data related to one of the products (cycling through so 
each product is focused on once per quarter). Your team has a dedicated 15-min¬ 
ute spot on the agenda to present voice of customer data related to product of 
focus for the given month. This is done through the Customer Feedback Analysis 
slide deck, which always follows the same format: a slide each for title page, data 
and methodology, analysis, and findings. 

As a bit of background on the customer insights-related data you track, custom¬ 
ers rate your products on a 5-star scale. You categorize 1-3 stars as "detractors" 
(those not likely to recommend the product); 4 stars are "passives"; 5 stars are 
"promoters" (those likely to recommend the product to others). The primary met¬ 
ric of focus is Net Promoter Score (NPS), which is the percent of promoters minus 
the percent of detractors, expressed as a number (not a percent). You typically 
look at NPS over time and compared to your competitor set for a given product. 
Customers rating your products also have the option of leaving comments, which 
your team categorizes into themes. 

The product you focus on—an app—is on the agenda this month. You've up¬ 
dated the data and have found something interesting: while NPS has generally 
increased overtime, underlying feedback has become increasingly polarized, with 
both promoter and detractor populations increasing as a proportion of total over 
time. Analysis of customer comments indicates a theme of latency and speed 
concerns among detractors. You'd like to bring this to light and use it to frame a 
recommendation to prioritize latency improvements for the product. This seems 
like the perfect situation in which to employ the various lessons we've reviewed 
and practiced over the course of this book! 

The graphs presented on the Analysis slide of your typical deck are shown in Figure 
7.6a. Study it in light of the scenario described, then complete the following steps. 
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Analysis 


NPS Over Time 
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PROMOTER COMMENTS 

Not very actionable, promoter 
comments are often short notes 
like "awesome product," "it’s 
great." and "I love it!" 


DETRACTOR COMMENTS 
Detractor comments are mainly 
on the topic of latency and 
speed. Should we take this 
feedback into account in our 
feature release prioritization? 


FIGURE 7.6a Typical graphs presented in monthly meeting 

STEP 1: Form your Big Idea for this situation. Remember the Big Idea should (1) 
articulate your point of view, (2) convey what's at stake, and (3) be a complete sen¬ 
tence. Write it down. If possible, discuss it with someone else and refine. Create 
a pithy, repeatable phrase based on your Big Idea. 

STEP 2: Let's take a closer look at the data. Write a sentence or two about each 
graph that describes the primary takeaway. 

STEP 3: Time to get sticky! Get some sticky notes. In light of the context de¬ 
scribed, the Big Idea you created in Step 1, and the takeaways you outlined in 
Step 2, brainstorm the pieces of content you may include in your slide deck. After 
you've spent a few minutes doing this, arrange the pieces along the narrative arc. 
What is the tension? What can your audience do to resolve it? 

STEP 4: It's time to design your graphs. Download the original graphs and under¬ 
lying data. You can either modify the existing visuals or create new ones. Put into 
practice the lessons we've covered on choosing appropriate visuals, decluttering, 
and focusing attention. Be thoughtful in your overall design. 

STEP 5: Create the deck you will use to present using the tool of your choice. Also 
outline the accompanying narrative of what you'll say for each slide. Even better: 
present this deck, walking a friend or colleague through your data-driven story. 
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Solution 7.6: net promoter score 

STEP 1: My Big Idea could be something like, "We will continue losing users unless 
we improve the latency of our product: let's prioritize this in the next feature release." 

For my pithy, repeatable phrase, I'll want something simple that doesn't feel over¬ 
ly salesy given the audience and typical meeting approach. Plus, I anticipate they 
will have additional context to lend as we together determine whether my recom¬ 
mendation is the best course of action. I can use something like, "Let's learn from 
our detractors." I could title my deck with this and weave it into my call to action. 

STEP 2: Looking back at Figure 7.6a, my takeaways could be as follows. 

• Top left: NPS has increased steadily recently, and as of February 2020, is 
at a 14-month high of 37 (NPS was 29 at this point in time last year, the 
lowest it's been over the time period observed). 

• Top right: We currently rank 4th in NPS across our competitive set. Our 
15 competitors have NPS ranging from a high of 47 (Competitor A) to a 
low of 18 (Competitor O). 

• Bottom left: There has been a shift in makeup across promoters, pas¬ 
sives, and detractors over time. Our users are becoming increasingly 
polarized, with the proportion of passives shrinking as the proportion of 
promoters and the proportion of detractors increases. 

• Bottom right: A high proportion of detractors leave comments, and their 
primary concern is latency. 
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STEP 3: Figure 7.6b shows a basic narrative arc for this scenario. 
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FIGURE 7.6b Narrative arc 

STEPS 4 & 5: The following progression shows how I could weave everything 
together into a data-driven story with thoughtfully designed visuals, employing 
the various lessons covered in SWD and this book. 


Today, I want to tell you a story. It's the story of what we've learned from our anal¬ 
ysis of recent customer feedback. Let me offer a sneak peek —as indicated by my 
title, detractors play an important role—and we can learn from them in ways that 
may influence the go-forward strategy for our product roadmap. (Figure 7.6c) 


Let’s learn from our detractors 

Monthly NPS Update 

Presented by: Customer Insights Team 
Date: March 1,2020 


FIGURE 7.6c Title slide 
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I have two primary goals today. First, to bring you up to speed on what we've 
learned from our analysis of recent customer feedback and related data. It turns 
out looking at NPS alone doesn't tell the whole story. Detractors are increasing. 
Second, I'd like to use the feedback from detractors to frame a conversation on 
how we can address their concerns. This will likely play into the product strategy 
and possibly impact the upcoming feature release schedule. (Figure 7.6d) 


Goal today 


1 Build a common understanding of recent feedback. 
Though NPS has been flat to increasing over time, analysis of 
components reveals increasingly polarized customer base, with a 

marked recent increase in detractors. 


2 Revisit product strategy given detractor feedback. 

Detractor comments are on one theme above all others—latency. This 
should influence how we prioritize the various planned product 
improvements. Determine whether and what changes to make. 


FIGURE 7.6d Goal today 


Let's take a look at the data. NPS has generally increased over time and has con¬ 
sistently increased in the past four months to 37 as of last month. (Figure 7.6e) 


NPS: flat to increasing over time 


NPS over time 
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FIGURE 7.6e NPS: flat to increasing over time 
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This 37 NPS puts us in 4th place relative to the competition. We anticipate that 
learning from our detractors and addressing their concerns will ultimately improve 
our positioning among competitors. (Figure 7.6f) 


NPS: we rank 4 th against the competition 


NPS by company - February 2020 
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FIGURE 7.6f We rank 4th against the competition 


But as I mentioned, NPS alone doesn't tell the full story. Let's take a look at the 
components. As a reminder, we categorize customers based on their ratings of 
our product. Those rating us 1-3 stars are categorized as "detractors" (those not 
likely to recommend the product); 4 stars are "passives"; 5 stars are "promoters" 
(those likely to recommend the product to others). NPS is the percent of promot¬ 
ers minus percent of detractors. NPS provides a good aggregate measure but 
doesn't give us insight into how the breakdown across its components are chang¬ 
ing over time. So next, let's take a look at those components. 
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Before I add the data, let me talk you through what we're going to be looking at. The 
y-axis represents the percent the given component — detractors, passives, and pro¬ 
moters—make up of total. We have time on our x-axis, ranging from January 2019 
on the left to our most recent point of data, February 2020, on the right. (Figure 7.6g) 


Let’s look at NPS components 
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FIGURE 7.6g Let's look at NPS components 

I'm going to do something a little different here and build this graph from the mid¬ 
dle out. These grey bars represent the proportion of total made up by passives. 
You see the proportion of passives is shrinking markedly over time: the height of 
these grey bars is getting smaller. (Figure 7.6h) 


Proportion of passives decreasing 
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FIGURE 7.6h Proportion of passives decreasing 
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Some of this change is good news: we've seen an accompanying increase in the 
proportion of promoters; the dark grey bars at the top are getting bigger over 
time. (Figure 7.6i) 


Proportion of promoters increasing 
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FIGURE 7.6i Proportion of promoters increasing 

But as you can probably anticipate based on the empty part of my graph and my 
commentary so far, the detractor population is also increasing as a percent of total. 
(Figure 7.6j) 


Proportion of detractors increasing 
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FIGURE 7.6j Proportion of detractors increasing 
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And actually, let's put a couple of numbers on the graph to help understand the 
magnitude of this increase. Detractors made up 10% of those giving feedback at 
the beginning of 2019. This increased marginally, to 13% of total, over the first 
half of last year. Since then, the detractor population has nearly doubled as a per¬ 
cent of total. As of February this year, detractors make up 25% of those leaving us 
feedback about our product. (Figure 7.6k) 


Detractors: nearly doubled since Aug 


NPS component breakdown over time 



FIGURE 7.6k Detractors: nearly doubled since Aug 
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In addition to numerical ratings, customers also have the option of leaving com¬ 
ments that lend further context. Overall, 15% of those rating our product leave a 
comment. Relatively fewer promoters leave comments and they tend to be pretty 
general and less actionable: things like, "It's great!" and "I really like it!" But we 
get some incredibly rich detail from our detractors. Relatively many more leave 
comments — 29% of those rating us 1-3 stars share additional detail. (Figure 7.61) 


Detractors: relatively more comments 


Comments by NPS component 



FIGURE 7.61 Detractors: relatively more comments 

Our detractor comments focus on one topic more than any other: speed and 
latency concerns. 

Let me read you a sample verbatim comment: "My frustration in a single word: 
latency. It takes forever for the app to open. When it works, it works great. But I 
spend too much time waiting and wondering whether it's ever going to load. It 
often hangs when opening." 
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It is disheartening to read comments like this from our users. We've been focused 
on adding more features but it seems something that might help more is making 
sure the basics work seamlessly. (Figure 7.6m) 


Comments provide insights into issues 


1 in 3 


detractor comments are 
about speed or latency 


This was the single biggest comment theme. The next most common theme, 
unexpected restarts, accounted for only 6% of detractor comments. 


FIGURE 7.6m Comments provide insights into issues 

Now, I fully recognize that there is other context to consider. But I want to make 
sure to bring this customer insight data to light so that we can take it into account 
in our overall product strategy. Improving the latency of our product can help us 
reverse the increase in detractors and simply make for happier users. How should 
this play into our product strategy and upcoming release schedule? Let's discuss. 
(Figure 7.6n) 


RECOMMENDATION: 

Revisit our product and feature release 
strategy in light of this feedback and 

prioritize latency improvements. 

Let’s discuss. 


FIGURE 7.6n Recommendation 
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Consider how the path we just took our audience along differs from the typical linear 
approach of methodology-analysis-findings that was outlined in the onset of this sce¬ 
nario. We can use data storytelling to capture and maintain our audience's attention 
and frame a productive data-driven discussion. Leaving the room after this meeting, 
you'd know the analysis you undertook will help influence decision making. 

Will your audience always do what you want them to? Of course not. There are 
likely competing priorities or maybe speeding the app up is actually a really com¬ 
plicated thing. The great thing is, framing things in terms of a recommendation— 
thus giving the folks in the room something specific to react to—will drive conver¬ 
sation that will bring additional relevant context to light. Presenting a data story 
does not mean you know all the details or have all the answers. But it does mean 
thinking about the data and how we communicate it in a deeper way. When we 
are thoughtful about how we do this, we can influence richer debates and smarter 
decisions. Success! 

We've practiced the holistic process of data storytelling together a handful of times. 
Next up you'll find additional examples and case studies to work through on your own. 
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While Chapter 7 posed problems and offered solutions, Chapter 8 has a number 
of unsolved exercises: to answer them, you will need to draw on the various les¬ 
sons we've covered over the course of SWD and this book. These can be used 
as assignments, individual or group projects, or incorporated into tests or exams. 
They will also be useful for those simply wanting additional opportunities to apply 
the storytelling with data lessons. 

The exercises in this chapter can be worked through on your own or with a part¬ 
ner or small group. They grow in nuance and complexity as you move through 
them. For topics or data that don't feel immediately relevant to your work, I still 
encourage you to complete the exercises. Continued rehearsal of lessons helps 
them become ingrained and enables you to refine your skills in a low-risk setting. 
Additionally, practicing in different contexts frees you up from the constraints of 
normal day-to-day work, which may bring to light more creative approaches. Af¬ 
ter completing an exercise, get feedback and consider what components of your 
solution you might employ in your work. 

A number of exercises invite you to execute the recommendations you outline 
in the tool of your choice. This additional application helps you better learn your 
tools and further hone your data visualization and data storytelling skills. 

For those assigning exercises from this chapter, feel free to take liberties. There 
is no end to the number of assignments you can create by mixing and matching 
specific discussion points or instructions across the various examples. You might 
use similar exercise framing with your own visuals to create custom exercises. 

Let's practice more on your own! 

But before you dive in, let's review some common myths in data visualization. 
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Exercise 8.1: diversity hiring 

Your organization recently implemented a diversity hiring initiative for its "ABC 
Program." You're interested in understanding the relative success of the initiative. 
Familiarize yourself with Figure 8.1—a slide showing related data—-then complete 
the following steps. 


2019 ABC Program Hiring Highlights 


Hiring Overview - 2019 Incoming Interns and Analysts 

131 hires made across all ABC Programs. 3.60 6PA goal was slightly exceeded. 




2019 Hires by 

LOB and Position Type 



Program 

Intern 

Analyst 

Intern 

MBA 

Full Time 
MBA 

Subtotal 

% of Total 
Hires 

ABXL 

40 

36 

8 

3 

87 

66% 

ARC 

20 

5 

2 

0 

27 

21% 

EMA 

6 

5 

0 

0 

11 

8% 

REP 

4 

0 

0 

0 

4 

3% 

QB 

2 

0 

0 

0 

2 

2% 

Total 

72 

46 

10 

3 

131 

100% 


2019 Hire Average GPA 
3.66 


Diversity Hiring Overview - 2019 Incoming Interns and Analysts 

Female hiring target of 25% was exceeded (26%). Achieved ethnic diversity goal of 40%; however, five ethnically diverse 
candidates reneged on their offers. Ratio of diverse to non-diverse hires is 1:1. 


Category Type 

# Hires 

% All Hires 

Ethnic Female 

12 

9% 

Ethnic Male 

30 

22% 

Non-Ethnic Female 

23 

17% 

Non-Ethnic Male 

66 

49% 

Hired TBD 

0 

0% 

Open 

0 

0% 

Reneged 

Total 

5 

136 

4% 


2019 Diversity Hires by LOB and Type 

Program 

EF 

EM 

NEF 

# Div 

% Div 

Non-Div 

% Non-Div 

ABXL 

7 

25 

15 

47 

54% 

40 

46% 

ARC 

2 

3 

6 

11 

41% 

16 

59% 

EMA 

2 

1 

1 

4 

36% 

7 

64% 

REP 

1 

0 

1 

2 

50% 

2 

50% 

QB 

0 

1 

0 

1 

50% 

1 

50% 

Total 

12 

30 

23 

65 

50% 

66 

50% 


Diversity Category Type Key: EF Ethnically Diverse Female, EM = Ethnically Diverse Male, NEF = Non-Ethnically Diverse Female, and Non-Div = Caucasian Male 


FIGURE 8.1 Program hiring highlights 

STEP 1: Let's start with the positives: what do you like about this slide? 


STEP 2: What is not ideal about Figure 8.1 ? Make notes or discuss with a partner. 


STEP 3: What is the primary takeaway? Is this a success story or a call for action? 
Articulate in a sentence or two the point(s) you would focus on if you were pre¬ 
senting this data. 


STEP 4: Assume you need to present this data and have been told it has to be in 
tabular form (a table or set of tables). Are there improvements you can make to 
the way the data is shown given this constraint to better focus on the takeaway 
you formed in Step 3? Draw (or if you prefer, download the data and create in 
your tool) the table or tables you would use and detail where and how you would 
direct attention. 
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STEP 5: Assume you have more liberty to make changes. How might you present 
the data? How would you tell a data-driven story about diversity hiring in the ABC 
Program? Outline your planned approach, then create your ideal materials in the 
tool of your choice. 


Exercise 8.2: sales by region 

Imagine that you are the Sales Manager of the Northwest (NW) region at your 
company. You've pulled the following slide (Figure 8.2) from a monthly report and 
want to cover it at an upcoming offsite with your sales team. You're preparing 
content together with your Chief of Staff. Let's consider two scenarios: 

SCENARIO 1: The offsite is tomorrow, and both you and your Chief of Staff have 
a number of other items to tackle in the meantime. You don't have time to fully 
redesign the visual in Figure 8.2. Assume you can spend at most five minutes 
making changes. What would you do? How would you present the information? 

SCENARIO 2: The offsite is a week away, and your Chief of Staff has volunteered 
to redesign the information shown in Figure 8.2. Before doing so, she's asked for 
your feedback. What aspects do you like about the current visual that you would 
want to be preserved? What changes would you suggest based on the lessons 
we've covered? 

Make notes or discuss with a partner. 


Sales by region: 2018 & 2019 

Data is from Sales Dashboard as of December 31, 2018 and December 31,2019 



O 


% 

S’ 


73 


cu 

CD 


■ 2018 
■ 2019 

• Annualized Growth 


- The NW, SW, and NE continue to have the highest sales, making up a combined 53% of total sales. 

- Sales in the NW region decreased 11% between 2018 and 2019. 

- In contrast, sales in the SE, Canada, and Mexico increased over the same time period. 


FIGURE 8.2 Sales by region 
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Exercise 8.3: revenue forecast 

Study Figure 8.3, which depicts gross and net revenue over time, then complete 
the following steps. 


Net revenue is forecast to grow proportionately with gross revenue, as the gross-to-net 
differential is based on a historical average, not projected going forward. 


Revenue 

2012-2024 



FIGURE 8.3 Revenue forecast 

STEP 1: What do you like about this visual? 


STEP 2: Reflect on the use of the data table. Do you find this effective? If so, 
explain why. If not, how might you approach it differently? 

STEP 3: What other changes would you make? Take notes or talk with a partner. 


STEP 4: Let's envision two distinct circumstances for communicating this data: (1) 
presenting in a live meeting and (2) emailing it to your audience. What would you 
do differently in these two situations? To take things a step further, download the 
data and create your preferred visuals in the tool of your choice, making assump¬ 
tions as needed. 


STEP 5: Select a data visualization tool that you have not previously used (see the 
Tools section of the Introduction for a partial list). Re-create your visual in this new 
tool. What did you learn from this experience? Write a paragraph or two outlining 
your insights. 
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Exercise 8.4: adverse events 

Imagine that you work at a medical device company. A colleague approaches you 
with the following slide (Figure 8.4) summarizing a couple of points from a recent 
study and asks for your feedback. Spend a few moments examining it, then com¬ 
plete the following steps. 





Rate of Adverse Events 13 


Clinical success* in 89.8% of patients 
using optimal settings 13 

Adverse events after procedure 

• 7.5% of Raciplath patients with optimal settings vs. 13.1% 
with non-optimal settings and 17.7% in control group 


Reduced procedural costs 

• Use of optimal settings with Raciplath resulted in fewer 
post-procedure clinical events, translating to a 16% reduction 
in post-procedure management costs ($3,422 savings per 
patient) in the year after the procedure compared to patients 
treated with competitor products 14 


Total Care Management Cost per 
Patient in Year after Procedure 14 


30 


20 


10 


0 


P < 0.001 


$3080 

$3715 




$21,671 

$18,249 

16% 

lower 

($3422) 



Raciplath with Primary competitor 
optimal settings 142 pts 

85 pts 


FIGURE 8.4 Adverse events 

STEP 1: Before jumping into constructive criticism, it can be nice to point out 
what has been done well. What do you like about this slide? 

STEP 2: What questions would you ask your colleague? Make a list. 

STEP 3: What changes might you recommend based on the various lessons we've 
covered? Outline your thoughts, focusing not just on what you would recommend 
changing, but also why. 

STEP 4: How would your recommendations change if you knew this information 
was going to be presented to a non-technical audience? 

STEP 5: To take things a step further, download the data and create your re¬ 
vamped slide, incorporating the changes you've outlined in prior steps in the tool 
of your choice. Make assumptions as needed. 
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Exercise 8.5: reasons for leaving 

Imagine that you are Chief of Staff for the Chief Marketing Officer (CMO) at a large 
company. Your boss, the CMO, has asked you to work with your Human Resources 
Business Partner (HRBP) to understand what is driving attrition—people leaving 
the company—across the marketing organization and present your findings. Your 
HRBP digs into the data, then emails you the following visual, Figure 8.5. 

Spend a few minutes processing this data, then complete the following steps. 


Reasons for Leaving 


HR Business Partner Input 


High 


o 

e 

© 

© 

" 

© 

31 

9 

128 

e 


Low 


Frequency 


High 


Exit Survey 



£ Training 
O Conflict with others 
(3 Lack of recognition 
0 Workload 
o Career advancement 
# Pay 

(3 Type of work 
0 Career change 
Q Commute 
^ Relocation 
(3 Illness 


FIGURE 8.5 Reasons for leaving 
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STEP 1: What is being shown here? Write a few sentences explaining this data: 
how do we interpret this visual? Make assumptions as necessary for the purpose 
of this exercise. 

STEP 2: What is confusing or not ideal about the visual in its current form? What 
questions would you ask or feedback might you give your HRBP? On a related 
note, assume your HRBP spent a lot of time creating this visual—how can you 
frame your feedback so they don't take offense? 

STEP 3: Let's draw! Come up with three different ways to show this data. What 
are some advantages and shortcomings of each? List them. Which view do you 
like best and why? 

STEP 4: Download the data and create your preferred visual in the tool of your choice. 

STEP 5: It's time to present this data to the CMO. Make an assumption about 
whether you'll walk through it live or send it to be consumed on its own. Create your 
recommended communication in light of this assumption in the tool of your choice. 


accounts over time 
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Exercise 8.6: accounts over time 

You are an analyst in a sales organization and your team has been asked to sum¬ 
marize a current campaign, assessing how things are going against the goal of 
increasing the number of accounts. You have actual data through September 2019 
and a forward-looking forecast through the end of 2020. Your colleague has pulled 
together the following summary (Figure 8.6) and asked for your feedback. Spend a 
couple of minutes examining the visual, then complete the following steps. 


MARKET MODEL: ACCTS AND FIELD COVERAGE 


450.00 

400.00 

350.00 

300.00 

250.00 

200.00 

150.00 

100.00 

50.00 

0.00 


Accounts 


Integration 


77% achieved in 9 mos 


, 387 


Tm III I 


..Ml 
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311 ■ I 
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I 


II 


I I I I 
III B 


Great market & Successful launch 


1. Opened 77% of potential accounts in 9 months 

2. Increased ASP by 54% 

3. Continue to open new accts and maintain business despite 
smaller sales footprint 


mMiiiiimmi 


J? S' S’ 1? I? # jC # # # t t f t t # 

? f / 4 /V Y / f J f J ^ / / T / > <? / cF 




Accts/AM 


2.0 | 7.0 | 92 111.0116.0 1 17.3118.2| 19-g| 10 21 Igej 20-01 20 41207121.1121.s|2t e| 22.3122.7123 1123.51 23.21 24.2124 g| 25.0 


FIGURE 8.6 Accounts over time 


STEP 1: What questions do you have about this data? Make a list. 

STEP 2: Let's declutter: make a list of the elements you would remove. 

STEP 3: What is the story—is this a success or call to action? What is the tension in 
this scenario? What action do you want your audience to take to resolve this tension? 

STEP 4: How would you recommend showing this data? Draw or download the 
data and iterate in the tool of your choice to create your preferred design. 

STEP 5: Consider how your approach would vary if you were (1) presenting this 
data live in a meeting and (2) sending it to your audience to be consumed on its 
own. How would the way you'd tackle this differ? Write a few sentences explaining 
your thoughts. To take it a step further, redesign this visual for these different use 
cases in the tool of your choice. 
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Exercise 8.7: errors & complaints 

In this scenario, you are an analyst working at a national bank. At the beginning of 
each year, your team compiles a year-end review for each portfolio. This contains 
data from many parts of the lending process—from originations to collections. 
You've been tasked with analyzing data and creating content on the topics of 
quality and satisfaction in the Home Loans portfolio. You started by pulling the 
slide that was used for this in the prior year's review and updated it with the latest 
year's data. The resulting graphs are shown in Figure 8.7. 

Instead of having a page of graphs, you'd decided to use this opportunity to tell 
a data story. 

Spend a few minutes studying Figure 8.7, then complete the following steps. 


Home Loans Quality & Satisfaction Year-End Review 


Inbound Quaity 

Overall Product A 

GOAL 80% GOAL: 80% 


SSiSIMSSSiS i = |iisaps!i 


35% 33% 

jiiMpsis iaMiisiag§g 


Historical Complaint Summary 



JAN FEB MAR APR MAY JUN JUl AUG SEP OCT NOV OEC 
Solicited Unsolicited .Average 


Top Error Reasons (Overall) 
54.4% 



Top Complaint Categories 

100 



20 


JAN FEB MAR APR MAY JUN JUl AUG SEP OCT NOV DEC 
■^—Employees ^“Experience Service -Rates -Procedures 


DATA SOURCE APPLICATIONS & COMPLAINTS TABLES FROM DATA WAREHOUSE | JANUARY 1 - DECEMBER 31 

FIGURE 8.7 Errors & complaints 
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STEP 1: What questions would you ask about this data? Make a list. Next, answer 
each of these questions, making assumptions for the purpose of this exercise. 

STEP 2: Write a sentence or two about each graph that describes the primary 
takeaway. 

STEP 3: What story or stories would you focus on here? Which data would you 
include? Is there any data you would omit? Will it all work on a single slide, or 
do you believe it would be best to use more? Sketch your planned approach on 
blank paper. 

STEP 4: How would you visualize the data in Figure 8.7 to lend insight into what 
we should focus on in this situation? Consider all of the lessons that we've covered 
and how you would apply them. Download the data and build your materials us¬ 
ing the tool of your choice to tell the story with this data. 

STEP 5: Imagine you draft your slide(s) and get feedback from your manager that 
the audience will expect a page of graphs, like they've seen in the past. How will 
you respond to this? Write out your thoughts. 


Practice's™ own 
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Exercise 8.8: taste test data 

Craveberry is the new yogurt product that your food-manufacturing employer is 
preparing to launch. The product team on which you work decided to do an ad¬ 
ditional round of taste testing to get a final gauge of consumer sentiment before 
going to market with the product. You've worked with your team to analyze the 
results. You are getting ready to meet with the Head of Product to discuss whether 
to potentially make changes before going to market. (If this sounds familiar, it's 
because we introduced it in context of the narrative arc in Chapter 6.) 

Your colleague puts together the following visual (Figure 8.8) summarizing the 
taste test results and asks for your feedback. Spend a moment studying it, then 
complete the following steps. 



FIGURE 8.8 Taste test data 

STEP 1: Let's start with the positives: what do you like about this slide? 

STEP 2: What feedback would you give based on the lessons we've covered? 
Outline your thoughts, focusing not just on what you would recommend chang¬ 
ing, but also why. 

STEP 3: Let's step back and think about story. Reflect on the various components 
of the narrative arc: plot, rising action, climax, falling action, ending. List these 
components and what you would cover within each for this scenario. Even better: 
write out the points of your planned story on sticky notes and arrange them in the 
shape of the narrative arc. Edit as needed to outline the story you would tell with 
this data. What is the tension? What can your audience do to resolve it? 
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STEP 4: Download the data and create the data-driven story you outlined in Step 
3 in the tool of yourchoice. Also outline the accompanying narrative of what you'll 
say when you present to the Head of Product. 


Exercise 8.9: encounters by type 

The following situation may sound familiar; we've seen it once before in Exercise 
6.3. Read through the scenario to refresh your memory, then examine the data 
and complete the following steps. 

You work as a data analyst at a regional health care center. As part of ongoing 
initiatives to improve overall efficiency, cost, and quality of care, there has been a 
push in recent years for greater use of virtual communications by physicians (via 
email, phone, and video) when possible in place of in-person visits. You've been 
asked to pull together data for inclusion in the annual review to assess whether 
the desired shift towards virtual is happening and make recommendations for 
targets for the coming year. Your analysis indicates there has indeed been a rela¬ 
tive increase in virtual encounters across both primary and specialty care. You've 
forecast the coming year and expect these trends to continue. You can use recent 
data and your forecast to inform targets. You believe seeking physician input is 
also necessary to avoid setting overaggressive targets that could inadvertently 
lead to negative impact on quality of care. 

Figure 8.9 shows the data you'll use to build your story. 


Encounters over Time by Type 

Per 1000 Patients 


2015 

2016 

2017 

2018 

2019 

2020 (Proj) 

In Person 

Total 

3,659 

3,721 

3,588 

3,525 

3,447 

3,384 


Primary Care 

1,723 

1,735 

1,681 

1,586 

1,526 

1,500 


Specialty Care 

1,936 

1.986 

1,907 

1,939 

1,921 

1,884 

Telephone 

Total 

28 

39 

138 

263 

394 

535 


Primary Care 

26 

34 

125 

212 

295 

375 


Specialty Care 

2 

5 

13 

51 

99 

160 

Video 

Total 

0.3 

0.5 

1.6 

2.8 

3.4 

4.5 


Primary Care 

0.2 

0.3 

0.4 

0.8 

1.2 

2.0 


Specialty Care 

0.1 

0.2 

1.2 

2.0 

2.2 

2.5 

Email 

Total 

1,240 

1,287 

1,350 

1,368 

1,443 

1,580 


Primary Care 

801 

831 

852 

856 

897 

950 


Specialty Care 

439 

456 

498 

512 

546 

630 

TOTAL 

Total 

4,927 

5,048 

5,078 

5,159 

5,287 

5,504 


Primary Care 

2,550 

2,600 

2,658 

2,655 

2.719 

2,827 


Specialty Care 

2,377 

2,447 

2,419 

2,504 

2.568 

2,677 


FIGURE 8.9 Encounters over time by type 
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STEP 1: It's difficult to see what's going on with the data in tabular form, so let's 
start by visualizing it. You can sketch it, or download the data in Figure 8.9 and 
create graphs in the tool of your choice. Do this to build a better understanding of 
the data and what we can learn from it. As part of this, you'll likely want to answer 
the following questions: 

(A) How have the total number of encounters changed over time? 

(B) How do encounters break down across the various types? Is the desired 
shift towards virtual channels (telephone, video, and email) happening? 

(C) Is there a difference between Primary and Specialty Care when it comes 
to use of virtual channels? 

(D) What targets would you recommend for Primary and Specialty Care virtual 
encounters based solely on the data? 

STEP 2: Consider the provided context along with what you learned in Step 1. 
You anticipate that you'll have to present this data live. Create a low-tech outline 
of your data story. You may do this in written form, putting the various takeaways 
into words and creating a bulleted list. You could also make use of some of the 
tools we've discussed—sticky notes, storyboarding, and the narrative arc. Or per¬ 
haps you have other ideas. Plan your data story in the way that works best for you. 

STEP 3: Create the data story you outlined in Step 2 using the tool of your choice. 

STEP 4: In addition to the live progression, you'll need a one-pager to be shared 
with those who missed the meeting or as a reminder of what was covered. Create 
this visual in the tool of your choice. 
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Exercise 8.10: store traffic 

You are an insights analyst at a large national retailer. You have just completed an 
analysis of recent store traffic and purchase trends. You've visualized the data and 
believe there is a compelling story to tell. 

Store traffic has decreased since last year, both overall and across all regions. 
Traffic decreased the most in the Northeast—this makes sense, as your company 
closed several stores there in the past year. Many of those stores' customers are 
now shopping with your competitors. The decrease in traffic is also more marked 
for your most important customer group, which your organization calls "Super 
Shoppers." This year-over-year gap has increased in recent months. But traffic— 
the number of people shopping—is only one piece of the puzzle. To understand 
how changes in traffic manifest in changes in sales (something management cares 
deeply about), you also have to take into account how much people spend while 
at your store. You measure this by "basket," which is made up of unit purchases 
(the number of items bought) and the price per item. 

The data shows that customers—and Super Shoppers in particular—are generally 
buying fewer items; however, the average price of those items has increased. This 
is likely due to targeted promotions your stores offered with luxury brands in the 
past year. Because of the positive impact of the promotions, you'd like to recom¬ 
mend to senior management to further investigate the financial implications of 
running additional Super Shopper promotions, both to further test this hypoth¬ 
esis and—more importantly—in hopes of turning around the undesirable trends 
you're seeing in the data. 

You were talking through the visuals you created as part of this analysis with your 
manager. In doing so, you realized that the graphs you used to figure out the 
story may not work well for getting that information across to your stakeholders. 
Your manager has asked you to revamp the graphs and pull together a short slide 
deck to communicate your findings and recommendation to senior management. 
You've decided to take a step back and use this as an opportunity to employ the 
various lessons we've covered over the course of SWD and this book. 


NMO W wflJjlDWd 


NMO jna ^ ^SJlDVdd 


372 practice more on your own 


Figure 8.10 shows your original graphs. Spend some time studying these, then 
complete the following steps. 


YEAR-OVER-YEAR TOTAL CUSTOMER CHANGE 

00 % 


YEAR-OVER-YEAR SUPER SHOPPER CHANGE 



TRAFFIC, BASKET & BASKET COMPONENTS YEAR-OVER-YEAR CHANGE: WEEKLY VIEW 


5 00% 
4.00% 
3.00% 
2.00% 
100% 
0.00% 
100% 
2.00% 
3.00% 
4.00% 



Jun Week 1 JunWeek 2 JunWeek3 Jon Week 4 JulWeekl JolWeek2 JulWeek3 JulWeek* JulWeekS AugWeekl AugWe«k2 AugWeek3 AugWeek4 


6 00% 

4 00% 
2.00% 

0 00% 

-200% 
-4 00% 
-6 00% 
-8.00% 

■ 10 00% 


^■Avg Price per Item ($) Avg tt Items Bought -Traffic - Basket 


FIGURE 8.10 Your original graphs 


STEP 1: Form your Big Idea for this scenario. Remember the Big Idea should (1) 
articulate your point of view, (2) convey what's at stake, and (3) be a complete 
sentence. Refer to the Big Idea worksheet in Exercise 1.20 if helpful. After crafting 
it, discuss it with someone else and refine. Do you think it makes sense to form 
a pithy, repeatable phrase based on your Big Idea? If so, create one, referring to 
Exercise 6.12 as needed. 


STEP 2: Let's take a closer look at the data. Write a sentence or two about each 
graph that describes the primary takeaway. 

STEP 3: Time to get sticky! Get some sticky notes. In light of the context de¬ 
scribed, the Big Idea you created in Step 1, and the takeaways you outlined in 
Step 2, brainstorm the pieces of content you might include in your slide deck. Af¬ 
ter you've spent a few minutes doing this, arrange the pieces along the narrative 
arc. What is the tension? What can your audience do to resolve it? 

STEP 4: Next, spend some time with the data and design your graphs. Download 
the original graphs and underlying data (you'll find some additional data there, 
too, which may be useful). You'll likely need to iterate through a few different views 
of the data. Consider drawing your ideas as part of your iterating and brainstorm¬ 
ing process. Put into practice the lessons we've covered on choosing appropriate 
visuals, decluttering, and focusing attention. Be thoughtful in your overall design. 
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STEP 5: Create the deck you will use to present using the tool of your choice. Also 
outline the accompanying narrative of what you'll say for each slide. Even better: 
present this deck, walking a friend or colleague through your data-driven story. 

STEP 6: Let's take a few minutes to reflect. Compare the original graphs with what 
you've created. Do you believe your solution will be more effective? Why? How 
did this overall process feel? Which parts were most helpful and why? How can 
you envision applying components of what you've done in this exercise to your 
work in general? Write a paragraph or two outlining your thoughts. 


You've made it: you've learned by example! You've practiced a ton. You've honed 
your data storytelling skills. Congrats! If you haven't already started, you are defi¬ 
nitely now ready to practice at work. Let's move on to some final exercises de¬ 
signed to give you the skills and confidence you need to succeed telling stories 
with data in your day job. 


Practice's™ own 



chapter nine 


practice more 
at work 

The final chapter of exercises focuses on how to apply the storytelling with data 
lessons at work. You've encountered a good amount of this guidance already and 
I encourage you to refer back to the practice at work exercises throughout this 
book when facing a specific project: the initial exercise in this chapter will help 
you do just that. 

Additionally, you'll find guidance for further integrating the storytelling with data 
process into your and your colleagues' day-to-day work. This will help you exam¬ 
ine and practice the totality of the lessons we've covered over the course of SWD 
and this book. You'll be provided with resources and guides for facilitating group 
learning and discussion and assessment rubrics that can be used to evaluate your 
own or others' work. We'll review the important role of feedback and how to 
best give and receive it, as well as setting—and helping others set—good goals 
for continuing to improve your data storytelling skills. There is no such thing as 
an "expert" in this space; regardless of skill level, there is always room for further 
growth. We can all continue to refine our abilities and become more nuanced in 
how we communicate with data. 

Awesome work completing the exercises so far (and if you haven't completed 
them all, that's okay—it means you have more to go back to!). Next, let's practice 
some more at work! 

To be helpful, let's first review some ideas for setting you and your team up for 
success. 
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Exercise 9.1: create your plan of attack 

You've seen a number of practice at work exercises already in Chapters 1 - 6. You'll 
find all of these listed below. When facing a project where you need to communi¬ 
cate with data or present a story, scan through this list. Determine what combina¬ 
tion of exercises will work best for your needs, then complete them! 

1.17 get to know your audience 

1.18 narrow your audience 

1.19 identify the action 

1.20 complete the Big Idea worksheet 

1.21 solicit feedback on your Big Idea 

1.22 create the Big Idea as a team 

1.23 get the ideas out of your head! 

1.24 organize your ideas in a storyboard 

1.25 solicit feedback on your storyboard 

2.17 draw it! 

2.18 iterate in your tool 

2.19 consider these questions 

2.20 say it out loud 

2.21 solicit feedback on your graph 

2.22 build a data viz library 

2.23 explore additional resources 

3.11 start with a blank piece of paper 

3.12 query do you need that? 

4.9 test where are your eyes drawn ? 

4.10 practice differentiating in your tool 

4.11 figure out where to focus 

5.9 make data accessible with words 

5.10 create visual hierarchy 

5.11 pay attention to detail! 

5.12 design more accessibly 

5.13 garner acceptance for your designs 

6.12 form a pithy, repeatable phrase 

6.13 answer what's the story ? 

6.14 employ the narrative arc 
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Exercise 9.2: set good goals 

I am a huge proponent of goal setting. When you articulate something you would 
like to happen and then plan the steps you can take to make it so, that thing is 
simply more likely to be accomplished! Setting good goals is one way to help en¬ 
sure ongoing focus on further developing and honing your data storytelling skills. 

At its core, the way to do this is simple. Isolate the skill or aspect of your work you 
want to advance. Then list the specific actions you can undertake to do it. Create 
a sense of urgency by making the actions time sensitive. Post this list somewhere 
you can see it for a regular reminder. Share with a manager or colleague to cre¬ 
ate additional accountability. If you're like me, the feeling of being able to check 
off completed things that lead you closer to your goal is super gratifying. Even 
more powerful is how these actions help you refine your skills and increase your 
expertise as you accomplish your initial goals and set increasingly ambitious ones. 

If you crave a more specific structure for goal setting, I'll walk you through one 
momentarily. First, a caveat that if you have a current process that works, I encour¬ 
age you to continue to use it. At storytelling with data, I set annual big-picture 
goals for the company. On a quarterly basis, all individuals (including me) follow 
a goal-setting and assessment framework that I learned at Google. I'll outline our 
process in case useful as you are setting—or helping your team set—goals. 

We document and measure our quarterly Objectives and Key Results (OKRs) to 
maintain focus and accountability on goals that support the business. The objec¬ 
tives define what the individual wants to accomplish. These should be significant 
and communicate action. The key results describe how the given objective will be 
met. Key results should be aggressive yet realistic, measurable, limited in number, 
and time-bound (with target frequency or completion date). Individuals typically 
have 3-5 objectives for the quarter, each supported with 2-3 key results. For illus¬ 
tration, here is an example objective and associated key results: 

OBJECTIVE: Thoughtfully integrate story into my presentation of pilot program XYZ, 
gaining approval for the resources needed to officialize and expand the program. 

• KEY RESULT 1: Complete SWD: let's practice! exercises 1.17, 1.20, 1.21, 
1.23, 1.24, 6.12, and 6.14 for two different projects by Jan 31st. 

• KEY RESULT 2: Plan and create separate materials by Jan 15th optimized for 
the given setting: live presentation and email summary. Solicit and incorpo¬ 
rate feedback from Key Stakeholder A by Jan 31 st. 

• KEY RESULT 3: G ive three practice presentations this quarter to colleagues 
and integrate their feedback to improve my content, flow, and delivery style. 
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Once an individual's OKRs are finalized, we publish and make them available to 
the broader team. This sort of transparency around what we're each trying to ac¬ 
complish increases everyone's odds for success. 

About a week after each quarter, we review and grade the past quarter's OKRs. 
The reflection piece is helpful both to pause and celebrate successes as well as 
evaluate where not as much progress as planned was made. Grading—one of the 
most important steps in OKRs—takes this part of the process a step further. We 
rate ours on a simple 0-10 scale, with zero indicating no progress and 10 indicat¬ 
ing the key result was completely achieved (for example, if the key result was to 
create 12 graphs with a new tool you're learning and you created 12, you score 
10; if you created 6, you score 5, and so on). 

I find assigning a number helps us each be honest with ourselves and ensures 
adequate accountability. It's easy to say, "I could have done more." But when 
I score myself a zero or two (for example), that causes a different level of intro¬ 
spection: why didn't I do more? Did priorities change and is that okay? Or if not, 
what's keeping me from doing it? How do I change that in the future? I have these 
conversations with myself (and my husband, who also used to work at Google and 
helps keep me accountable!) about my own OKRs and with the individuals on my 
team about theirs. We aggregate each objective score by averaging the scores of 
the key results for that objective. Averaging the respective objective scores gives 
a summary for the overall quarter. In sum, reviewing the prior quarter's OKRs and 
corresponding scores helps drive really useful conversations around where things 
are going well, competing priorities, challenges, and potential solutions. All of 
that then feeds into the next OKR setting process for the current quarter and we 
do it all over again. 

I credit the OKR process as having helped me personally continue to improve my 
skills and create and expand a successful business. I appreciate the disciplined 
thinking it enforces and the way it officializes accountability. It establishes indi¬ 
cators for measuring progress: at any point, we each know how far along we are, 
where we're missing what we set out to do, and where we're being successful. As 
my team has grown, it also helps ensure everyone knows what is important and 
can align their individual objectives so we are all working towards the same goal. 

Your turn! What is a specific goal you have related to cultivating your skills for ef¬ 
fectively visualizing or communicating with data? Write it down. Next, identify 2-3 
key results that will help you achieve this objective. Discuss it with your manager. 
Post it somewhere you will regularly see it. Congrats, you've just written your first 
OKR. Next, complete it! 

For more on goal setting in general and the OKR process in particular, check out 
Episode 13 of the storytelling with data podcast (storytellingwithdata.com/podcast), 
which focuses on goal setting. 
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Exercise 9.3: give & receive effective feedback 

Getting feedback and iterating is an incredibly important part of the process for 
evolving our skills. We all know this. Yet it can be understandably difficult to open 
yourself up to critique. When receiving feedback, it's easy to become defensive 
rather than really listen and absorb. Here are a few thoughts related to this to help 
you get—and give—more effective feedback. The next time you find yourself in 
need of input, refer to the following. 

Determine who to ask. Spend some time thinking about who will be well posi¬ 
tioned to give you feedback based on your specific needs. We often think first 
of someone who is familiar with the situation, but pause to reflect on what type 
of feedback will serve you best. Lack of context can be a good thing because it 
ensures a totally fresh perspective. This can be especially useful if your audience 
isn't close to the work you are doing, as it will help point out inaccessible lan¬ 
guage, assumptions you may unknowingly be making, unfamiliar types of visu¬ 
alizations, or other issues that could inhibit successful communication. That said, 
expert feedback is sometimes warranted. This can be useful, for example, when 
facing a technical audience and needing to make sure you're well prepared for 
anticipated scrutiny, or when some level of context is necessary in order to offer 
useful feedback. 

Time it right. This temporal aspect is important both for you as well as for the 
people you ask to spend time providing feedback. When it comes to timing for 
you—the earlier in the process you can get input, often the better. At this point 
you've put in less effort, which means you're less attached to a particular path or 
output and can more easily change directions. Particularly if you have strongly 
opinionated stakeholders, starting the process of feedback early on can help re¬ 
duce iterations over the course of a project. That said, there will be different aspects 
of the work that people can better evaluate in a completed form compared to early 
stage (something mocked up or hand-drawn, for example), so getting feedback 
at multiple points in the process can be helpful for highly critical projects. When it 
comes to timing from the critiquer's standpoint: be respectful of their schedule and 
try to time your requests for feedback at points that are convenient for them. If you 
find it difficult to get someone's feedback when you need it, set up time to do this 
live, where you can review and get the needed input over the course of a meeting. 
Be appreciative towards those who provide you feedback. 

Be clear on the focus. Do you want to understand if the graph is easy to read, if 
your point comes across efficiently, or something else? Be as specific as you can 
about where you need input so that you get useful feedback in the right places. 
Consider as part of this whether it will be helpful to share the context of who 
the audience is, what they care about, or what knowledge they have. It can also 
be useful to convey what constraints you faced as part of the process—or what 
constraints you will face for incorporating feedback. For example, if it's late in the 
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game and you just need a second opinion on whether something works, you can 
make this clear: "Tell me what you think of this. I have to send it out today, so I'm 
looking for anything that's unclear, or issues I can hopefully resolve quickly so I can 
meet my 5 p.m. deadline. Given that, what reactions can you share?" On the oth¬ 
er hand, if you have ample time, you can be broad in your request for feedback: 
"Really, everything is fair game. I'm trying to understand broadly what's working 
well and how I might further refine." 

Listen, don't talk. When someone raises a point of constructive criticism, it is 
natural to want to respond and justify your decision with all the reasons you ap¬ 
proached something the way you did. Refrain from doing this, as it may shut 
down the conversation. Instead, listen with an open mind and without judging 
the feedback. Acknowledge that you've heard. Take notes. Encourage the person 
giving the feedback to keep talking. Ask probing questions to better understand 
this alternative point of view. If needed to help drive the conversation, refer to the 
following questions. 

After listening: ask questions. If you need more feedback after you've listened, 
use the following discussion prompters. 

• Where do your eyes go first on this page? 

• What is the main takeaway or message? 

• Talk me through how you process the graph: what do you pay attention to 
first? Next? 

• Are there things you would have done differently? Why? 

• Is anything distracting from the message? 

• Set it aside. Can you tell me the main point or story? What else do you remember? 

Weigh the input you receive. Not all feedback is equal and you will sometimes 
receive bad advice. Who it is coming from will typically be the dictating factor 
for whether it must be followed or can be ignored. That said, if you find you're 
being met with resistance, step back and really try to take any feelings or attach¬ 
ment to what you've created out of it to figure out what's not working. If you're 
unsure about the feedback you've received, seek another opinion. If this backs 
up the first, don't assume the issue sits with those providing feedback—assume 
it's something about the design. Take time to understand this so you can identify 
and address the root issue. 

Provide good feedback to others. Getting good at giving feedback can have 
positive benefits for framing the feedback you'd like from others. It can also help 
you sharpen your thinking in ways that ultimately improve your own work. Refrain 
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from dogmatically identifying aspects as right or wrong or good or bad. When 
giving feedback, be thoughtful about how you frame it: the person likely spent 
time creating what they are sharing with you and they are putting themselves in 
a position of vulnerability when they ask for feedback. They also probably faced 
constraints to which you don't have visibility. Ask them to be clear on what feed¬ 
back they want so you can direct your comments accordingly. Always make the 
feedback about the work, not the person. Before offering ideas for changes, point 
out what is executed well. On this note, I've heard of one team's approach for 
giving each other feedback is by completing the following sentences: "I like...," 
"I have questions about...," and finally, "I'd suggest..." Another similar feedback 
framework is analyze-discuss-suggest, where you start by analyzing the graph or 
slide or presentation. After you've spent time doing so, discuss. Only after all of 
that do you make suggestions. 

Impromptu input can be helpful, too. While sometimes the process of getting 
feedback is formalized—for example, one of the ideas in forthcoming Exercise 
9.4 is to organize a group feedback session—that won't always be the case. When 
you find yourself wondering whether something works, print it out or have a col¬ 
league peer over your shoulder at your computer and share their thoughts. You 
can pull ideas from this exercise in these more impromptu scenarios, too. Get 
input from others to iterate and refine your work from good to great! 

Beyond seeking feedback for your own work, there are steps that can be taken to 
help build a culture where feedback is part of the norm for your broader team or 
organization. We'll talk more about this in the next exercise. 
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Exercise 9.4: cultivate a feedback culture 

As we've discussed, getting input from others is hugely important to understand 
what is working and when further iteration is warranted as you hone your data 
storytelling skills. Creating an open culture where feedback is part of the norm 
across your team or organization is a critical part of this—and often means taking 
intentional steps to cultivate. 

Simply telling people they need to solicit or provide feedback likely isn't enough 
to create the right sort of enriching environment. At work, stakes are often high. 
This sometimes causes individuals to be hesitant to admit they need feedback. 
If the culture isn't right or the feedback isn't delivered well, it can be taken as a 
personal attack rather than constructive criticism—which can be more detrimental 
than beneficial. That said, if you find yourself or your team facing challenges in 
this area, there are plenty of things you can do to shift the culture and help people 
practice and develop confidence when it comes to giving and receiving feedback. 
Here are a few ideas: 

• Introduce "present & discuss" time in your regular team meeting. Reserve 
ten minutes of a recurring team meeting for a team member to present some¬ 
thing they are working on or have recently completed. Then have a conversa¬ 
tion where each person shares a positive point about the work and a sugges¬ 
tion for further improvement. Rotate who shares each time. 

• Assign "feedback buddies." Pair up people on your team (or across teams) 
and set expectations around at what points in a project people should seek 
and provide feedback, or with what frequency over a given time period (for 
example, twice weekly over a 1-month period). Managers can help hold peo¬ 
ple accountable by asking about the feedback received and incorporated 
during one-on-ones or project updates. After a predetermined amount of 
time (a month or a quarter), shuffle the partners. This can help forge stronger 
relationships among your team or across teams and also integrate regular 
feedback into the process. 

• Hold a feedback "speed dating" session. Invite those who have a specific 
graph or slide on which they'd like feedback. Instruct them to print out a copy 
and bring it with them to the session. Organize tables in opposing rows so 
that individuals face one another in pairs. Identify a timekeeper with a loud 
voice to manage the session. At the first "Go!," have each duo exchange the 
printed work product they brought and allow one minute to quietly review. 
Each person should then spend two minutes asking questions and giving sug¬ 
gestions (with the timekeeper watching the clock, alerting everyone when to 
change focus). Each pair will be together for a total of five minutes (one-min¬ 
ute review + two-minute suggestions to Person A + two-minute suggestions 
to Person B). After five minutes, have those on one side of the table move 
seats by one position (the person on the end fills the now empty spot on the 
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opposite side). Repeat until you run out of people or time. If your organization 
does lunch-and-learn or other less formal gatherings, this can be a fun team 
or cross-team activity to integrate. 

• Conduct a formal feedback session. Schedule an hour. Each person should 
bring a physical copy of something on which they'd like feedback (a story¬ 
board, a graph, a slide, a presentation). Set expectations for the session and 
share tips for giving effective feedback (see Exercise 9.3). Divide people into 
groups of three. Within these triads, each person should spend five minutes 
lending context to what they've brought to share and the specific feedback 
they seek, followed by ten minutes of group discussion and suggestions. Ro¬ 
tate so that each person in the group has an opportunity to share their work 
and get feedback (for a total of three 15-minute segments, one each focused 
on Person A, Person B, and Person C's work). End with a full-group debrief on 
what worked well, whether you'll do it again, and what you would do differ¬ 
ently next time. This can be a standalone session or also works well as part of 
a team offsite. 

It can also be useful to create non-work-centric forums for presenting and ex¬ 
changing feedback. This takes away some of the pressure, making it easier to 
give and receive critique. As people become practiced exchanging feedback in a 
low-risk setting, they start to build a habit that can better enable them in the work 
environment. This can be a particularly smart approach if you're trying to change 
the culture in an environment that hasn't historically been conducive to open 
feedback. Here are a couple of ideas related to this: 

• Introduce "review & critique" time in your regular team meeting. This is 
similar to the "present & discuss" idea raised previously, but focuses on pub¬ 
licly available examples rather than work-specific ones. This takes pride and 
potential for taking criticism personally off the table entirely. Assign a team 
member ahead of time to source a graph, slide, or data visualization from the 
wild (for example, the media). Spend a few minutes talking through it, then 
have each person share a positive point about the work and a suggestion 
for what they might do differently. Rotate who picks the example each time. 
There can be a tendency to look for non-effective examples, which is fine, but 
evaluating good examples in this manner can make for productive conversa¬ 
tions and help people identify finer points of feedback, too. 

• Create a team monthly #SWDchallenge. See Exercise 2.16 or storytelling- 
withdata.com/SWDchallenge for more on our monthly challenges. As a team, 
you can participate in a live challenge, pick one from the archives, or create 
your own. In the first week of the month, set the specific challenge and en¬ 
courage participants to identify non-work data of interest. Over the course of 
the rest of the month, individuals or partners should create their respective 
data visualizations. At the end of the month, schedule time in person or virtu¬ 
ally and invite those who participated. Have each person or duo present their 
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creation and get feedback from others. This idea is inspired by Simon Beau¬ 
mont (Global Director of Business Intelligence at Jones Lang LaSalle), who 
has been doing something similar with his team. Simon observed that, over 
time, this has led to both improved data visualization and more productive 
feedback exchange among his team in general. Additionally, they record their 
webinar feedback sessions and make them broadly available so others at the 
organization can watch and learn as well. 


If your team could benefit from cultivating a culture of feedback, step back and 
think about how you can best do that and whether one of these ideas might help. 
Take liberties to design something that will work well for your team based on the 
environment. Learn and iterate and determine how to evolve the ways in which 
you facilitate feedback over time. Doing this well can help everyone hone their 
skills and create better data communications! 
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Exercise 9.5: refer to the SWD process 

Over the course of this book, we've covered six lessons to set you up for success¬ 
ful data communications. It can be helpful to revisit these with a specific project 
in mind. When you find yourself needing to communicate with data, read the 
following for a reminder of the main lessons we've covered and some thought 
starters to reflect on for the project you face (each of these corresponds to a 
chapter of the same number). Refer back to the practice with Cole solutions within 
each chapter for illustrations and examples and the practice at work exercises for 
additional guidance on applying the various lessons. 

(1) Understand the context. Who is your audience? What motivates them? What 
do you want to communicate to your audience? Articulate your Big Idea. The Big 
Idea has three components, it (1) articulates your point of view, (2) conveys what's 
at stake, and (3) is a complete sentence. Create a storyboard of the components 
you'll cover with your audience to help them understand the situation and con¬ 
vince them to act. Determine what order will work best; arrange sticky notes to 
create the desired narrative flow. You now have a plan of attack to follow. Get 
client or stakeholder input at this point if possible. 

(2) Choose an appropriate visual. What do you want to communicate? Identify 
your point and how you can show your data in a way that will be easy for your 
audience to understand. This often means iterating and looking at your data a 
number of different ways to find the graph that will help you create that magical 

"ah ha" moment. Draw it! Consider what tools and other resources you have at 
your disposal to realize your drawing and then create it. Ask for feedback from 
others to learn whether your visual is serving its intended purpose or give you 
pointers on where to iterate. 

(3) Eliminate clutter. Is there anything that isn't adding value? Identify unneces¬ 
sary elements and remove them. Reduce cognitive burden by visually connecting 
related things, maintaining white space, cleanly aligning elements, and avoiding 
diagonal components. Use visual contrast sparingly and strategically: don't let 
your message get lost in the clutter! 

(4) Focus attention. Where do you want your audience to look? Determine how 
you can draw your audience's attention to what you want them to see through 
position, size, and color. Use color sparingly and strategically, considering tone, 
brand, and colorblindness. Employ the "Where are your eyes drawn?" test to 
understand whether you're using preattentive attributes to effectively direct your 
audience's attention. 

(5) Think like a designer. Words help data make sense. Clearly title and label 
graphs and axes and employ a takeaway title to answer the question, "So what?" 
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Create visual hierarchy of elements to ease the processing and make it clear how 
to interact with your visual communications. Pay attention to details: don't let 
minor issues distract from your credibility of message. Make your visual designs 
accessible. Spend time on the finer details of your design: your audience will ap¬ 
preciate it, heightening the odds for successful communication. 

(6) Tell a story. Refer back to your Big Idea: create a pithy, repeatable phrase 
from it. Revisit your storyboard and arrange the components of your story along 
the narrative arc. What is the tension? How can your audience act to resolve it? 
Where and how does data fit into the narrative? How will your materials for a live 
presentation vary from those that are sent out to be consumed on their own? Cre¬ 
ate a data story that captures your audience's attention, drives a robust discussion, 
and influences action! 

Want to hang the preceding list at your desk for easy reference? A downloadable ver¬ 
sion can be found at storytellingwithdata.com/letspractice/downloads/SWDprocess. 

How do you know if you've applied these lessons well? Refer to Exercise 9.6 for 
an assessment rubric that can be used for this. 


Exercise 9.6: make use of an assessment rubric 

I don't tend to use rubrics in the context of visualizing and communicating data. 
People like rules, and they are too easy to turn into a formulaic approach when 
more nuanced thinking is warranted. That said, I understand the desire to have a 
way to assess the effectiveness of your own or others' work. Consider the follow¬ 
ing framework to be a starting point to address this need. 

I'm intentionally not going to be prescriptive or formulaic in how this should be 
used. I'll outline a couple of options, but I encourage you to give thought to what 
makes sense given the specifics of your situation. Are you a manager giving feed¬ 
back to an individual or multiple individuals on your team? An instructor needing 
to grade assignments? Or an individual wanting to judge your own work? 

For those simply seeking a structured way to assess your own or others' work, you 
can use the following as a checklist, or apply some simple labels (I personally like 
the three-category scale of "nailed it!," "good," and "more attention needed"; 
you can also sometimes have a fourth category of "not applicable"). A numeric 
score will be helpful if you need to assess across multiple individuals or see chang¬ 
es over time (for example, grading). You could use a simple 1-3 scale that aligns 
with the descriptors above, or a 1-10 scale if you crave a finer level of detail or 
find this more intuitive. 
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Component 

Assessment 

1 understand how to read the graph(s). 


The choice of visual(s) makes sense given the data and 
what is being communicated. 


An appropriate amount of context is present about the 
data/methodology/background. 


Words are used well to title, label, annotate, and explain. 


Visual clutter is minimized/absent. 


It is clear where 1 should focus first. 


Color is used effectively. 


The communication is free of misspellings, grammar 
mistakes, and math errors. 


The overall design is well structured: elements are 
aligned, white space is used well. 


The order of content makes sense. 


The main message and/or call to action is clear. 


Materials are optimized for how the content will be 
delivered. 


Overall success: the communication created solves the 
need it set out to. 









FIGURE 9.6 Example assessment rubric 
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The preceding list should be adjusted given what you are assessing. Is it a graph? 
A slide? An entire presentation? There might be some components that don't 
make sense for your scenario. There could be others you should add to have the 
full picture. I invite you to modify the rubric to best meet your needs. The final 
lines are intentionally blank to encourage you to think through what additional 
components may make sense given the specifics of what you are trying to assess. 

I'll close this exercise by pointing out that there are a number of intangibles that 
are harder to evaluate in a structure like this. These are little things that sum up 
to create a good or not-so-great experience. There is something about the man¬ 
ner in which this is achieved that plays into overall success as well. For example, 
one aspect to consider is whether and how time was optimized given the relative 
importance of what needs to be accomplished. You do not need to apply the 
entire storytelling with data process every time you touch data—be smart about 
where and how you apply the various lessons we've covered for maximum benefit 
with minimal additional work. That sort of efficiency and prioritization should be 
recognized. 

Use Exercise 9.5, which outlined the SWD process, as you work through a current 
project. Once finished, run through this rubric as a final assessment to ensure all 
components were addressed. 

You can download this rubric and adjust for your own needs at 
storytellingwithdata.com/letspractice/downloads/rubric. 


Exercise 9.7: facilitate a Big Idea practice session 

Before spending time visualizing data or creating content, pause to understand 
the context, consider your audience, and craft your message. Devoting thought 
to these important aspects can yield massive payback in being able to better 
meet your audience's needs, get your message across, and drive the action you 
seek. One way to encourage and kickstart this process across a team is to orga¬ 
nize and conduct a Big Idea practice session. 

This guide should give you what you need to introduce the concept of the Big 
Idea and facilitate an exercise with individual, partner, and group discussion com¬ 
ponents. The overarching goal is to help participants practice articulating the Big 
Idea and giving and receiving feedback to refine. 

Prep work: what to do ahead of time 

Read through this guide. Also review the Big Idea-related exercises in Chapter 
1. Explain the Big Idea to someone else and have them ask you questions. The 
resulting conversation will help you get comfortable talking about this concept, 
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which will better prepare you for facilitating it with a group (do this a couple times 
with different people if possible!). 

When it comes to the logistics for the session, decide who will attend. Book a 
room and send a calendar invite for 60 minutes. It's ideal if everyone can attend 
in person (if not possible, pair those joining remotely and have everyone tune 
into the main room for the introduction and debrief). Print a copy of the Big Idea 
worksheet per person (you can make copies from Exercise 1.20, or download from 
storytellingwithdata.com/letspractice/downloads/bigidea). If people tend to use 
laptops for everything, grab a handful of pens: this exercise is best done in a low- 
tech manner (encourage people to leave their laptops behind!). 

Example agenda (HH:MM) 

00:00 - 00:10 Introduce the Big Idea, talk through an example 
00:10 - 00:20 Hands-on exercise (Big Idea worksheet) 

00:20 - 00:30 First partner discussion 
00:30 - 00:40 Second partner discussion 
00:40-01:00 Group discussion 

A scenario to introduce the Big Idea 

Introduce the Big Idea by presenting the three components—remember, the Big 
Idea should: 

1. Articulate your unique point of view, 

2. Convey what's at stake, and 

3. Be a complete sentence. 

To illustrate, introduce the following scenario (excerpted from storytelling with 
data, Wiley, 2015), then demonstrate the example Big Idea. Alternatively, you can 
use a scenario and corresponding Big Idea from one of the practice with Cole 
exercises in Chapter 1 or create your own. 

SCENARIO: A group of us in the science department were brainstorming about 
how to resolve an ongoing issue we have with incoming fourth-graders. It seems 
that when kids get to their first science class, they come in with this attitude that 
it's going to be difficult and they aren't going to like it. It takes a good amount of 
time at the beginning of the school year to get beyond that. So we thought, what 
if we try to give kids exposure to science sooner? Can we influence their percep¬ 
tion? We piloted a learning program last summer aimed at doing just that. We in¬ 
vited elementary school students and ended up with a large group of second- and 
third-graders. Our goal was to give them earlier exposure to science in hopes of 
forming positive perception. To test whether we were successful, we surveyed the 
students before and after the program. We found that, going into the program, 
the biggest segment of students, 40%, felt just okay about science, whereas after 
the program, most of these shifted into positive perceptions, with nearly 70% of 
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students experiencing some level of interest towards science. We feel that this 
demonstrates the success of the program and that we should not only continue 
to offer it, but also to expand our reach with it going forward. 

Let's imagine that we are communicating to the budget committee who has con¬ 
trol over the funds that we need to continue our program. The Big Idea could be: 
The pilot summer learning program was successful at improving students' percep¬ 
tion of science; please approve our budget to continue this important program. 

This Big Idea: 

1. Articulates our point of view (we should continue this important program), 

2. Conveys what's at stake (improved student perception of science), and 

3. Is a complete (and single!) sentence. 

After introducing the Big Idea and talking through an example, it's time to turn it 
over to participants to practice. 

Hands-on exercise: the Big Idea worksheet 

Ask each participant to identify a project. This can be any example where they 
need to communicate something to an audience (something they can talk about 
openly, as they will be sharing with others). Pass out the Big Idea worksheet and 
ask participants to work through it for the project they have identified. 

Allow about 10 minutes for this. As participants are working their way through the 
Big Idea worksheet, wander the room to monitor progress and answer any ques¬ 
tions. After approximately 10 minutes or when you see that nearly everyone has 
completed writing their Big Idea (and hopefully every person has at least started), 
you can begin the partner discussion. 

Partner discussion 

It's okay if not everyone is totally done with their Big Idea, as they will still have a 
chance to refine as they confer with partners. Next, ask them to partner up and 
take turns sharing their Big Idea and giving each other feedback. I usually give a 
couple of specific directions related to this: 

• If there are people in the room who are more familiar with what you're going 
to be talking about and others who are less so, partner first with someone 
who is less familiar. If this requires standing up and moving around the room, 
please do so. 

• Receiving partner, your job is very important. Your job is to ask the person 
reading their Big Idea a ton of questions, helping them get clear and concise 
on their message. 
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Allow 10 minutes for this initial partner discussion. Wander the room to answer 
any questions that arise. After about 5 minutes, check in with each group to make 
sure they've moved on to the second person so both partners have a chance to 
share and get feedback. 

After about 10 minutes, direct participants to switch and partner up with a new 
person and repeat the process of taking turns to share and receive feedback. 
Again, check in with each at the five-minute mark to make sure people are shift¬ 
ing to the second person so both people in each partner group have a chance to 
share and receive feedback. Allow 10 minutes for this second iteration of partner 
feedback. Then direct participants to come back together for group discussion. 

Facilitate the group discussion 

The group discussion that you guide after the individual and partner work is im¬ 
portant for reinforcing content and helping address any questions or challenges 
that may have arisen as part of the Big Idea exercise. 

The following are questions to spark ideas (pause after each of these and let the 
conversation take its natural course, helping to reinforce the main points below): 

• Did you find this exercise easy or challenging? 

• What was hard about this exercise? 

• How did you get it down to a single sentence? 

• Show of hands: how many people found partner feedback to be helpful? 

• What was helpful about partner feedback? 

Points to make as part of the conversation: 

• Getting it down to a single sentence is hard. Concision is surprisingly difficult, 
especially when it's work that we are close to—it's hard to let go of all those 
details! The Big Idea won't be the only thing you communicate; rather, sup¬ 
porting content will come into play. 

• There are various strategies that can help you get it down to a single sen¬ 
tence. It can sometimes be useful to write a few sentences first, then trim. The 
Big Idea worksheet can also be helpful, since it breaks each component apart 
so you can deal with them one at a time. By the time you get to the end, you 
have the pieces and it's like a puzzle that you can work to put together in a 
way that makes sense. 

• Getting it down to a single sentence is important. The sentence restriction is 
arbitrary, but it is purposefully short. This forces you to let go of most of the 
details. You probably also have to do some wordsmithing. This is important: 
clarity of thought happens through this wordsmithing process. 
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• Saying it out loud is helpful. When we say things out loud, it ignites a different 
part of our brain as we hear ourselves. If you find yourself tripping up when 
reading your Big Idea, or something just doesn't sound right, these can be 
pointers on where to iterate. Because of this, there is a benefit to saying the 
Big Idea out loud, even if it's to yourself in an empty room. Even better if 
someone is present to react to what you've said, which brings us to the next 
point. 

• Partner feedback is critical. As we get close to our work, we develop tacit 
knowledge: things that we know that we forget others don't know (specialized 
language, assumptions, or things we take as given). Talking to a partner can 
be a great way to identify these issues and adjust as needed. The dialogue 
you have with a partner helps you get solid on the point you want to make 
and find the words that will help you make it clearly. 

• The partner need not have any prior context. It can be helpful if your partner 
doesn't have any context, because of the kinds of questions this will prompt 
them to ask. Simple questions like, "Why?" can be very useful, both because 
it helps point out something that is obvious to us but not to someone else 
and also because of the logic this forces us to articulate when it comes to an¬ 
swering. Your audience will never be as close to your work as you are, so solic¬ 
iting feedback from someone less familiar can be really useful for identifying 
the right words to make your overall point accessible and understandable. 

• Clearly articulating the Big Idea makes creating your communication easier. 

If you can't clearly get your point across in a single sentence, how will you put 
together a slide deck or report that will do so? Too often, we go straight to 
our tools and start building content, without having a clear goal in mind. The 
Big Idea is that clear goal—the guiding North Star, directing the process of 
generating supporting content. Once it's been formulated, it is the built-in 
litmus test for any bit of content up for consideration for inclusion: does this 
help me get my Big Idea across? 

Best of luck facilitating the Big Idea session! 
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Exercise 9.8: conduct an SWD working session 

I often facilitate working sessions with teams after conducting a storytelling with 
data workshop. I never cease to be amazed by the amount of progress people 
are able to make with a few low-tech tools and some dedicated time. Grab some 
colleagues, read SWD or this book, then use the guide outlined on the following 
pages to run your own storytelling with data working session. 

Prep work: what to do ahead of time 

Send a calendar invite for three hours to your team and book a conference room 
with plenty of table space and whiteboards. Stock up on supplies: colored mark¬ 
ers, flip-charts, and multiple sizes of sticky notes (the 6x8-inch ones are awesome, 
as they are the same dimensions as a standard slide and can be used to mock up 
an entire presentation in a low-tech manner; also have some smaller ones on hand 
for those who may want to do higher level storyboarding and focus on general 
topics and flow before getting into the details). 

Use the following instructions in combination with the storytelling with data pro¬ 
cess in Exercise 9.5 to organize a working session where everyone can have time 
and space to put lessons into practice, present, and receive feedback. The exam¬ 
ple agenda that follows works best for groups of 8-10 (enough time for everyone 
to present and give/receive feedback), but can be expanded for larger groups by 
adding more time to the present back portion (plan about 6-7 minutes per person 
or group). Instructions for participants are shown on the following pages; a down¬ 
loadable version to print can be found at storytellingwithdata.com/letspractice/ 
downloads/SWDworkingsession. 


For the actual session, nominate someone to be timekeeper. They should keep 
an eye on the clock during project work time and alert participants when they are 
halfway through and when 20 minutes remain to ensure everyone has a chance to 
prepare low-tech content to present. During the present back, they should watch 
the clock and move discussion along as needed to make sure everyone has time 
to share and get feedback from the group. 


Example agenda (HH:MM) 


00:00 - 00 
00:15-01 
01:30-01 
01:45-02 
02:45 - 03 


15 Recap lessons, discussion/Q&A, set expectations for the session 
30 Project work time 
45 Take a break! 

45 Present back 

00 Debrief, discussion/Q&A, wrap up 


The following pages contain instructions for participants. 
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Project work: how to focus your time 

Choose a project to focus on today. This can be on your own or in a group. Read 
through the storytelling with data process. Determine how you'd like to spend 
the next 75 minutes when it comes to putting one or more of the storytelling with 
data lessons into practice. Spend time sketching this out. 

Here are some ideas of how you might use your time: 

Lesson 1: understand the context 
Articulate the Big Idea or craft a storyboard. 

Lesson 2: choose an appropriate visual 

Draw various views of your data and identify which will enable you to best 
make your point. 

Lesson 3: eliminate clutter 

Is there anything that isn't adding value? Identify unnecessary elements and 
remove them. 

Lesson 4: focus attention 

How will you indicate to your audience where you want them to look? Plan your 
use of position, size, color, and other means of contrast to strategically direct 
your audience's attention. 

Lesson 5: think like a designer 

Final polishing will take place in your tools—your mock-up design can be 
rough. Still, give thought to how you'll organize elements to create structure 
and use words to make data accessible. 

Lesson 6: tell a story 

Sketch out the components of your story along the narrative arc. Where and 
how will data fit in? How can tension and conflict help you capture and main¬ 
tain your audience's attention? What pithy, repeatable phrase could you create 
to help your message stick with your audience? 

Keep it low-tech: pens and paper are at your disposal (but please keep laptops 
closed!). Use your colleagues in the room as brainstorm partners or to solicit feed¬ 
back from as you go along. Be creative and have fun! 

Present back: share with group 

You will have roughly five minutes to share a piece of what you've crafted or 
planned with the group. We will continue to be low tech: hand write or draw visu¬ 
als to support what you'd like to present to the group for feedback. 
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Your present back should include: 

1. A brief explanation of the background. This should incorporate the intend¬ 
ed audience, overall goal/objective, key decision(s) to be made, and what 
success looks like. 

2. How you've applied a storytelling with data lesson to your project. This 
could mean focusing on any of the following: your Big Idea, storyboard, 
ideas of how to best visualize the data, a comparison of how you had been 
looking at data and what changes you'll make, how you'll focus attention, or 
the overarching story you'll tell. This doesn't have to be fully executed; rather, 
well-formed ideas on what changes you'll make or the approach you plan to 
take are fine. Draw and make use of the pens/paper/stickies so everyone can 
see what you're envisioning. Frame the specific feedback you'd like from the 
group to help you continue to refine. 

Debrief: discussion and Q&A 

After everyone has had an opportunity to present and receive feedback, spend a 
few minutes discussing the following: 

• How did this session feel? Was the time spent useful? 

• What did you find most helpful? 

• Are there changes we should make if we do it again in the future? 

• What challenges do you anticipate applying SWD lessons in our work? 

• What additional steps can we take to improve how we communicate with data? 

Publicize the outcome of this session to other teams who might benefit (either by 
participating in something similar themselves or by virtue of being on the receiv¬ 
ing end of the materials planned in your working session) and your management. 
Share success stories. Identify any aspects of the session that didn't work as you'd 
anticipated, try to isolate why, and adjust. 

Do everything you can to help make folks aware and supportive of everyone's 
efforts: both the dedicated time planning and willingness to try new strategies 
for improving data-driven communications. Create champions who can help pro¬ 
mote the power of data done well. This will lead to increased recognition on this 
important piece of the process, which will hopefully manifest as patience with you 
and your team for the ongoing time and resources it will take to do it well. 
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Exercise 9.9: set yourself up for successful data stories 

When it comes to using story to communicate data, there are steps you can take 
to help improve your odds of success. The following outlines some specific things 
to consider when crafting and delivering data stories. 

Try new things in low-risk places first. Don't go into your next board or exec meet¬ 
ing and say, "Today, folks, I'm going to do something a little different-—today, I'm 
going to tell you a story." That's not a recipe for success! Especially if anything 
feels counter-cultural for your organization or markedly different from what you've 
done in the past, try it out in a low-risk setting first. Learn and refine. Get feed¬ 
back. Small successes will build your confidence and credibility for making bigger 
changes over time. 

Order thoughtfully. For anything we want to communicate, there are typically 
many options for how to order the content and there is no single correct ap¬ 
proach. Think about how you can organize things—whether elements in a graph, 
objects on a slide, or slides in a presentation deck—in a way that will make sense 
for your audience to create the overall experience you seek. Get feedback from 
someone less familiar with the content as a means to assess whether the way 
you've ordered your materials is likely to work for your ultimate needs. 

Optimize materials for how you are communicating. Presenting live opens up a 
different set of opportunities for building our data stories. As we've seen through 
a number of examples, one strategy is to build visuals piece by piece for our au¬ 
dience in a live setting. Pair this with a fully annotated slide or two for the version 
that gets sent around so that those consuming it on their own get the same story 
that you walk through in a live progression. Give thought to the specifics of how 
you will be presenting and create materials that will serve you well. 

Anticipate how it could go wrong. How might things go off the rails? How can you 
equip yourself to deal with that if it happens? Identify and pressure test your as¬ 
sumptions. Make sure you've investigated alternative hypotheses. Ask colleagues 
to play devil's advocate and poke holes or play a snarky audience member. An¬ 
ticipate questions and be well prepared to answer them. The time you spend 
prepping for how to respond to surprises will help you be more equipped to deal 
with them eloquently if they arise. 

Answer the question, "So what?" Never leave your audience wondering why 
they are looking at what you've put in front of them. Don't make them figure it out 
on their own. Make the purpose clear. Why are they here? What do you have to 
tell them? Why should they listen to you? Consider how you can use the various 
lessons we've covered to get your audience's attention, establish credibility, and 
lead them to a productive conversation or decision. 
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Be flexible. Rarely do things go exactly as planned. If you can anticipate that 
things are likely to head in a direction not entirely in your control, be thoughtful 
about how you organize your approach and materials to be able to deal with this. 
In some circumstances, a "choose your own adventure" story may be warranted. 
Demonstrating willingness to be flexible and adjust to your audience is one fan¬ 
tastic way of establishing credibility and can help you turn a potential nightmare 
situation into a successful one. 

Seek feedback. We've talked about feedback as you prepare your data stories, 
but it's also important to solicit feedback after you've presented. Get input from 
your audience or colleagues on what worked well and what you could adjust in 
the future to best meet their needs (and through that, your own). 

Learn from successes and failures. After each time you send off a report or pres¬ 
ent data, pause to reflect on how it went. For successful scenarios, think about 
why things worked and which aspects you can make use of in your future work. 
We can often learn even more from the cases that don't go as well. What caused 
issues? What is in your control that you can change in the future? Share success 
and failure stories so that others can learn as well. In this way, we can all help each 
other improve. 

The meta theme underlying all of these tips is to be thoughtful. Consider what 
success looks like and try to position yourself to make that happen so that the 
data stories you tell will have the impact you seek. 
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Exercise 9.10: let's discuss 

Consider the following questions related to everything we've covered over the 
course of this book and how you'll apply them in your work. Discuss with a partner 
or group. If you've been undertaking the exercises in this book with others on your 
team (or even if you haven't!), these will make for an excellent team conversation 
on how to integrate the storytelling with data lessons into everyone's work. 

1. What is one thing you will commit to doing differently going forward? 

2. Reflect on the lessons covered in SWD and this book: (1) understand the 
context, (2) choose an appropriate visual, (3) declutter, (4) focus attention, (5) 
think like a designer, and (6) tell a story. Which lessons are most critical to do 
well in your work? Why is that? Which areas do you—or your team—need to 
develop the most? How can you do this? 

3. Consider how you will apply the various lessons we've reviewed: where could 
things go wrong? How can you prepare for this or take steps to help ensure suc¬ 
cess? What other challenges do you anticipate? How will you overcome them? 

4. Are there additional resources that would be helpful for your overall success 
when it comes to effectively visualizing and communicating with data? 

5. What is your biggest takeaway from this book? How do you anticipate this will 
manifest in your day-to-day work? 

6. Where do gaps exist between how you work today and how you'd like to be 
working when it comes to data storytelling? How can you address those? 

7. Do you anticipate you will face resistance for the things you'd like to do differ¬ 
ently? Who do you think it will come from? What can you do to overcome it? 

8. We always face constraints when we work. What limitations do you face? How 
might these impact when and how you apply the storytelling with data les¬ 
sons? How can you embrace these constraints to generate creative solutions? 

9. What steps can you take to help others on your team or in your organization 
recognize the value of data storytelling and improve their skills? 

10. What specific goals will you set for yourself or your team related to the strate¬ 
gies outlined in this book? How will you hold yourself (or your team) account¬ 
able to these goals? How will you measure success? 
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closing words 

We have practiced a great deal over the preceding nine chapters! You should feel 
well-equipped to integrate the various lessons, tips, and strategies into your work. 
That said—you are not done practicing. 

Communicating effectively with data is like trying to solve a puzzle. Pieces of the 
puzzle include a gamut of different considerations, things like: audience, context, 
data, assumptions, biases, credibility, how you'll present, the physical space, the 
printer or projector, interpersonal dynamics, and the action sought. We have to fit 
all of these things together in a way that works. To complicate things, the puzzle 
pieces are different every time! 

But it's not a jigsaw: there isn't only one way. There isn't a single design or tech¬ 
nique that works. This sometimes frustrates people. But it's actually a really awe¬ 
some thing. Many different approaches could work. There's no end to how you 
can mix and match the lessons and strategies illustrated with your own twists to 
come up with potentially effective solutions. How fun is that? 

You've practiced. But your practice is not complete. We can all keep learning. We 
can continue to hone our data visualization design. We can become increasingly 
nuanced in how we tell stories with our data and use it to inspire others. 

And that is what I hope you will do. Put into practice what you've learned. Share it 
with others. Tell stories with your data that will influence positive change. 

My support and encouragement doesn't end here. Visit storytellingwithdata.com/ 
learnmore for information on the next phase of learning and advancement from 
the SWD team. 

Thanks for practicing with me! 
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idea, 13, 14, 30 
Bias, in data, 357 
Big Idea, 10-19, 388 

articulating point of view using, 10, 18 
back-to-school shopping exercise for, 
13-14 

comparing and contrasting, 13 
components of, 10, 18 
conveying what's at stake using, 10, 18 
crafting and refining during planning, 
1, 10 

creating as a team, 46 
critiquing your initial ideas for, 18-19 
facilitating practice session for, 
391-395 

forming, using Big Idea worksheet, 12 
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forming pithy, repeatable phrase 
exercise and, 278 
function of, 10 

pithy, repeatable phrase created 

from, 278, 321,328, 343, 389 
refining and reframing 

positive and negative framing 
exercises, 13-14, 30, 35 
practice on your own, 30, 35 
practice with Cole, 13-14 
revising, with feedback, 45 
single sentence summary of, 10, 18 
soliciting feedback on, 35, 45 
vaccine rate exercise for, 17-18 
Big Idea worksheet 
completing 

back-to-school shopping 
exercise, 10-12 
CFO's update exercise, 31-32 
pet adoption pilot exercise, 16-17 
university elections exercise, 33-35 
working as a team, 46 
description and function of, 10 
practice at work, 44 
practice on your own, 31-35 
practice with Cole, 10-12, 15-17 
reasons to use, 10 

sample worksheet, 11, 16, 32, 34, 44 
Bold font 

differentiating text elements using, 186 
focusing attention using, 154, 158, 167 
highlighting using, 227 
label text with, 158, 159 
line labels with, 71-72 
logos with, 216, 217 
sparing text use of, 115 
title text with, 158, 217 
visual hierarchy and, 227 
Borders 

branding graph with, 216-217 
closure principle and, 112,123, 124 
decluttering by removing, 123, 124, 
145, 202, 319 

setting apart columns and rows using, 
57 

Boxed data, to focus attention, 166-167 
Brain, seeing with, 148 
Brainstorming, in storyboarding, 3, 286 


back-to-school shopping exercise for, 

20, 22 

CFO's update exercise for, 37-38 
good number of ideas for initial list in, 
20, 24 

overview of process in, 3 
pet adoption pilot exercise for, 24, 26 
practice at work and, 46 
when to perform, 46 
Branding, 214-219, 223-224 

Coca Cola example of, 217-218 
market size over time graph and, 
214-219 

practice on your own exercise, 
223-224 

United Airlines example of, 216-217 
Butler, Jill, 227 

c 

Call breakdown overtime graph, 242-248 
Capitalization. See also All caps; Case 
axis titles and, 198, 202 
logos and, 216 

short word sequences with, 227 
titles and, 227, 241 
when to use, 227 

Car sales over time graph, 200-205 
Case. See also Capitalization 
all caps vs., 133, 198, 228 
titles and, 227, 241 
visual hierarchy and, 227 
Cat food sales graph, 156-163 
Centering 

of labels, 74, 121 
of logos, 216 

of text, 57, 121,131,208, 216, 217, 291 
of titles, 202 
Cesal, Amy, 229 
Chart Chooser, 104 
Chartmaker Directory, 104 
Charts. See Bar charts 
Checklist, for assessment, 389-391 
Chief Financial Officer (CFO)'s financial 

update exercise, 31-32, 37-38 
Chronological (linear) path 

data event over time and, 296-297 
moving to narrative arc from, 
exercise, 273-274 
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as narrative path of story, 237 
storyboard example using, 332-333 
Circling data, to focus attention, 165-166 
Climax, in narrative arc, 252, 253, 254, 255 
Closure principle, 109, 111 
decluttering and, 124, 139 
description and use of, 112 
heavy line removal and, 124 
Clutter, identifying and eliminating, 
107-146, 388 

alignment and white space exercises 
practice on your own, 140 
practice with Cole, 120-122 
audience needs and, 113, 114, 119 
axis labels and, 319 
clutter as enemy in, 108 
cognitive burden and, 108, 138, 141, 
142, 143, 145 

connection principle and, 118-119, 
126, 139 

contrast, nonstrategic use of, 109 
declutter exercises 

practice on your own, 141-143 
practice with Cole, 123-137 
deleting noncritical data and, 227, 277 
discussion questions at work on, 146 
drawing exercise for, 144 
enclosure principle and, 116-117, 139 
Gestalt and 

overview, 109 

practice on your own, 138-139 
practice with Cole, 111-112 
tying words to graphs, 113-119 
iterations in, 133-134 
lack of visual order and, 108 
market size over time graph and, 
111-112 

monthly voluntary exercise rate 
graph and, 113-119 
overview of (main lessons from SWD 
book), 108-109 

physician prescription patterns graph 
for, 120-122 

practice exercises for, 110-146 
overview, 110 
practice at work, 144-146 
practice on your own, 138-143 


practice with Cole, 111-137 
proximity principle and, 114-115, 

130, 139 

questions to ask yourself and, 145 
reducing redundancy in labels and, 209 
similarity principle and, 115-116, 117, 
119, 139 

storytelling and, 286 
SWD working session on, 397 
time to close deal graph for, 123-134 
tying words to graphs exercise, 113-119 
visual hierarchy and, 202, 227-228 
Coca Cola brand, 217-218 
Cognitive burden 

decluttering to lessen, 108, 138, 141, 
142, 143, 145 

explicit axis titles to lessen, 199 
Color 

accessibility and, 230 
axis lines and, 216, 218 
axis titles and, 176, 198, 216 
branding and, 214, 215, 216, 217, 218 
colorblindness and, 136, 154, 160, 
191,218, 230, 291,388 
data point highlighting using, 

114-115, 136, 158, 160 
decluttering by removing, 134, 202 
focusing attention using, 135-137, 
154, 155, 158, 159, 160, 

161, 162, 164, 169, 176, 

177,188 

graphs and. See Heatmaps 
grey for background data and, 176 
hue as preattentive attribute and, 

147, 148, 149, 154 
hue changes to focus attention using, 
170 

importance of using sparingly, 155, 
164, 213 

intensity changes using, 161, 169 
iterations in applying, 135-137 
labels and, 127, 130-131 
line graphs with, 66-67 
logos and, 216 

negative and positive associations 
with, 218 

paying attention to detail exercise 
and, 209, 210, 211,213 
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practice using with your tool, 188 
relative value indicated by intensity 
changes in, 58 

similarity principle and, 112, 115 
in tables 

focusing attention to establish 
hierarchy of information, 57 
relative intensity of color to 
indicate relative value, 58 
time to close deal graph for, 135-137 
titles of graphs and, 132 
transitioning from dashboard to story 
using, 271 

two-sided layout and, 204 
tying words to graphs using, 115, 119 
visual hierarchy and, 202, 227 
white space used in place of, for 
accessibility, 230 

Colorblindness, 136, 154, 160, 191,218, 
230, 291,388 

Compare and contrast, in refining Big idea, 
13 

Comparison 

designing data in graphs for, 210-211 
horizontal bar chart for, 59-60 
single graph for, 55-58 
slopegraph for, 62-63 
two graphs for, 58-59 
vertical bar chart for, 61 
Connection principle, 109, 111 

decluttering and, 118-119, 126, 139 
description and use of, 112 
tying words to graphs and, 118, 119 
Consistency 

case (capitalization) and, 241 
detail in graphs and, 208, 229 
time intervals on graphs and, 88-89 
Content 

organizing exercise for, 36-37 
planning process and, 1 
storyboarding for, 1 
Context, understanding, 1-49, 388 
audience and 

getting to know exercises, 5-6, 
28-29, 41 

identifying action to take, 43 
narrowing exercises, 7-9, 42 
back-to-school shopping exercise for, 
7-9, 10-12, 13-14, 20-23 


Big Idea for, 388 

completing worksheet exercises, 
10-12, 15-17, 31-35, 44 
creating as a team, 46 
critiquing your initial ideas, 18-19 
refining and reframing exercises, 
13-14, 30 

soliciting feedback, 45 
CFO's update exercise, 31-32, 37-38 
content, organizing exercise for, 

36-37 

deciding on right amount of data 
and, 357 

discussion questions at work and, 49 
eliminating data to focus attention 
and, 171 

exploratory vs. explanatory analysis 
and, 2 

importance of, 2 

overview of (main lessons from SWD 
book), 2-3 

pet adoption pilot exercise for, 16-17, 
16-17, 24-27 
planning process and, 1 
practice exercises for, 4-49 
overview, 4 

practice at work, 41-49 
practice on your own, 28-40 
practice with Cole, 5-27 
single "So what?" sentence on, 3, 309 
storyboarding and, 388 

arranging potential components, 
36-37 

brainstorming step, 3, 20, 22, 24, 
26, 37-38, 46, 286 
CFO's update exercise, 37-38 
editing step, 3, 20, 24-25, 38, 286 
getting feedback step, 3, 21,23, 
25, 27, 38, 48, 286 
organizing ideas, 47 
overview of process, 3 
revising, 39-40 

soliciting feedback exercise, 48 
sticky notes usage, 3, 20, 46 
three steps, 3 

university elections exercise, 38-40 
storytelling and, 286 
SWD working session on, 397 
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Continuity principle, 109, 111 
decluttering and, 139 
description and use of, 112 
Contrast 

colorblindness and, 230 
decluttering nonstrategic use of, 109 
effectiveness of graphs using, 112 
how to use, 109 

market size over time graph and, 112 
preattentive attributes and, 164 
Contrast and compare approach, in 
refining Big idea, 13 
Critiques. See also Feedback 
Big Idea and, 18-19, 31 
storyboarding and, 38 
Cross-functional teams, 376 
Culture of organization 

feedback and, 385-387 
learning and, 377 

Curvature, as preattentive attribute, 148 


Dashboards 

regular data release using, 268 
reports generated from, 276 
transitioning to story from, 268-271 
Dashed lines, for focusing attention, 168, 
188 

Data 

bias in, 357 

critical role of words in communicating 
about, 195, 197, 199 
deciding on right amount of, 357 
deleting noncritical data, 227, 277 
designing to make sense in graphs, 
209-212 

direct labeling of, 130, 230 
as evidence for making your point, 2 
footnotes for, 226 
myths about, 357 
in storyboarding, 36 
visual hierarchy and, 227 
visualizing, 51. See also Effective 
visuals, choosing 

Data labels 

adding for emphasis, 172-173 
audience focus and alignment of, 121 


bolding of, 160 
centering, 74 

color of, to match data, 130-131 
eliminating if not needed, 128-129, 
145 

eliminating, in decluttering, 128 
focusing attention using, 137, 158, 
160,172-173 
line graph with, 71 
location on bars, 127-128 
proximity principle and, 112, 130 
streamlining redundant information 
in, 145 

weather forecast graph with, 82 
Data markers, 78 

focusing attention and, 172, 173 
Data overview for new HR head exercise, 
5-6 

Data points 

changes over time and, 85 
color for highlighting, 114-115, 136, 
158, 160 

comparing across two graphs, 89 
connection principle and, 112 
critiquing graph for number of, 85 
labeling, 66, 78, 128, 145, 173, 187 
similarity principle and, 112, 117 
tying words to graphs and, 116, 117 
Data stories. See Storytelling 
Data visualization 

audiences and, 67 

benefits of decluttering, 123 

collections of examples of, 104 

common myths in, 356-357 

critical importance of words in, 241 

feedback on, 306 

goal of, 103, 357 

importance of design principles and 
accessibility in, 229 
learning from examples of, 98 
library of examples of, 103 
observing examples around us, 151, 
154, 220 

reviewing and critiquing in team 
meetings, 386, 387 
#SWDchallenge and, 98-99, 386 
Data visualization specialist, 376 
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Dates 

axis labels with, 125,126, 128, 209-210, 
211-212 

consistency in labels with, 208 
time intervals on graphs and, 88-89 
Declutter exercises, 123-134, 141-143 
Decluttering. See Clutter, identifying and 
eliminating 

Demand and capacity by month graph, 
68-75 

Designer, thinking like. See Thinking like a 
designer 
Detail in graphs 

alignment and, 208 
audience assumptions based on, 208 
consistency and, 208 
footnotes for, 226 
making minor changes for major 
impact exercise and, 221-222 

paying attention to detail exercises 
and, 206-213, 228-229 
practice at work exercise and, 228-229 
summarizing, to eliminate distractions, 
227 

thinking life a designer and, 389 
Diabetes rates exercise, 330-341 
Diagonal elements 

avoiding, 108, 126, 286 
decluttering and, 126 
how to use, when needed, 118 
Discussion questions at work 

choosing effective visuals and, 105 
decluttering and, 146 
focusing attention and, 189 
storytelling, 283 
SWD process at work, 401 
thinking like a designer and, 233 
understanding context and, 49 
Diversity hiring exercise, 359-360 
Documents 

progression building of graph with 

fully annotated slide as, 257, 
267 

structuring for audience needs, 6 
Dot plots 

attrition rate exercise for, 77 
demand and capacity by month 
exercise for, 74 


practice with Cole on, 74, 77 
when to use, 83 
Dotted lines 

continuity principle and, 112 
focusing attention with, 168, 188 
practice using with your tool, 188 
when to use, 112, 168 
Drawing exercises, 388 
decluttering and, 144 
on paper, 68-70, 91, 144, 286 
practice at work using, 100, 144 
practice on your own using, 91-92 
practice with Cole using, 68-74 
using an online tool, 70-75, 92, 286 
Duarte, Nancy, 3, 10 


Editing step, in brainstorming, 3, 286 

back-to-school shopping exercise for, 20 
CFO's update exercise for, 38 
pet adoption pilot exercise for, 24-25 
questions to ask yourself during, 20, 25 
Effective visuals, choosing, 51-105, 388 
attrition rate exercise for, 76-80 
average time to close a deal exercise 
for, 91-92 

bank index exercise for, 84-86 
choosing a graph exercise for, 94-95 
comparisons using, 55-63 
consistency in time intervals on graphs 
and,88-89 

critiquing exercise for, 84-86 
data visualization library for, 103 
demand and capacity by month 
exercise for, 68-75 
discussion questions at work for, 105 
drawing exercises for, 388 

on paper, 68-70, 91, 100, 286 
using an online tool, 70-75, 92, 
286 

financial savings exercise for, 85-86 
improving graphs in, 87-90, 92-93 
iterations in, 51,63, 97, 100 
learning from examples in, 98 
NPLs and loan loss reserves exercise 
for, 87-90 

overview of (main lessons from SWD 
book), 52-53 
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practice exercises for, 54-105 
overview, 54 

practice at work, 100-105 
practice on your own, 91-99 
practice with Cole, 55-90 
practicing out loud with graphs and, 
102 

resources, 104 

response and completion rates 
exercise for, 96 

simple text vs. graphs in, 52, 77 
single "So what?" sentence on using, 
77 

soliciting feedback and, 102-103 
spotting what's wrong exercise, 96 
starting point in using, 57 
storytelling and, 286 
#SWDchallenge and, 98-99 
SWD working session on, 397 
table improvement approaches, 55-67 
meals served over time exercise, 
64-67 

new client tier share exercise, 
55-63 

table use vs. graphs in, 52 
team goals and, 103 
thinking critically about what we want 
to show on, 83 

trying different representations using, 
63 

types of visuals and, 52-53 
visualizing and iterating exercises for, 
97, 100 

weather forecast exercise for, 81-83 
Enclosure 

as preattentive attribute, 148 
as signal on where to look, 154 
Enclosure principle, 109, 111 

decluttering and, 116-117, 139 
description and use of, 112 
tying words to graphs and, 116-117 
Encounters by type graph exercise, 369-370 
Ending 

leading story with, 237, 256 
narrative arc with, 252, 253, 254, 255 
Errors and reports graph, 366-367 


Examples 

learning from 

choosing effective visuals, 98 
thinking like a designer, 220-221 
library of data visualization, 103 
Excel 

conditional formatting in, 58, 65 
creating graphs in, 71, 74 
heatmap formatting using, 58, 65 
Exercises for practice. See Practice at 

work; Practice on your own; 
Practice with Cole 

Explanations 

action for audience to take in, 43 
conclusion in presentations and, 226 
differentiating between live and 

standalone stories exercise 
and,257-267 

pithy, repeatable phrase exercise 
and,278 

"So what?" question in, 279 
takeaway in, and titles, 241 
transitioning from dashboard to story 
exercise and, 268-271 
tying words to graphs in, 113, 157 
Explanatory analysis, 2, 43, 188, 268 
Exploratory analysis, 2, 188, 268 
Eye movement 

focusing attention and, 287 
in graphs, 156-163 
in pictures, 151-155 
practice on your own, 178-181 
practice with Cole, 151-163 
where your eyes are drawn to, 
149, 151,287, 388 
starting place and 

focusing attention, 149 
in tables, 57 

two-sided layout and, 204 
zigzagging "z" path in information 
processing on graphs and, 57, 
131, 198, 228, 247, 317 

F 

Falling action, in narrative arc, 252, 253, 
254, 255 
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Feedback 
Big Idea 

giving feedback, 18-19 
questions to ask, 45 
soliciting feedback, 35, 45 
cultivating culture of, 385-387 
drawing exercise and, 100 
facial responses in, 103 
giving and receiving exercise, 382-384 
on graphs at work, 102-103 
impromptu, 384 

seeking after storytelling presentation, 
400 

storyboarding, 3, 23, 286 

back-to-school shopping exercise 
for, 21 

CFO's update exercise for, 38 
pet adoption pilot exercise for, 

25, 27 

practice at work, 48 
questions to ask with your partner, 
21,25, 48 

stakeholder and manager opinions, 
48 

team meetings and, 386, 387 
Feedback culture, 385-387 
Financial savings exercises, 85-86 
Findings, in storyboarding, 36 
Fonts 

branding, 214, 215, 216, 217, 223, 228 
size and readability of, 217 
Footnotes 

for acronyms, 307 
for data, 226 

for detailed information, 307 
position of, 216 
spacing with, 121 
Form follows function, 192, 287 
Framing Big Idea 

back-to-school shopping exercise for, 
13-14 

benefits or risks defined during, 13, 
14, 30 

positive and negative framing, 13-14, 30 
practice on your own for, 30, 35 
practice with Cole on, 13-14 


Gestalt Principles of Visual Perception 
description and use of, 112 
overview of, 109, 139 
practice on your own using, 138-139 
practice with Cole on, 111-112 
tying words to graphs and, 113-119, 
229 

Goals 

graphs stating, 132-133 
Objectives and Key Results (OKRs) and, 
380-381 

pithy, repeatable phrase exercise and, 
278 

setting, exercise for, 380-381 
Graphic Continuum, 104 
Graphing applications 

drawing exercises using, 70-75, 92 
using default output exercise with, 
200-205 

Graphs. See specific charts and graphs 
bank branch satisfaction exercise for, 
84-86 

choosing. See Effective visuals, choosing 
comparisons using, 55-63 
consistency in time intervals on, 88-89 
critiquing exercise for, 84-86 
data table embedded in, 93 
deciding on right amount of data for, 
357 

demand and capacity by month 
exercise for, 68-75 

drawing exercises for 

on paper, 68-70, 91, 100, 286 
using an online tool, 70-75, 92, 286 
financial savings exercise for, 85-86 
goal stated in, 132-133 
improving, 87-90, 92-93 
iterations in, 51,63, 97, 100, 133-134 
learning from examples of, 98 
monthly voluntary exercise rate graph 
and,113-119 
myth about, 356 

NPLs and loan loss reserves exercise 
for, 87-90 

overall "feel" of, 229 
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practice exercises for, 54-105 
overview, 54 

practice at work, 100-105 
practice on your own, 91-99 
practice with Cole, 55-90 
practicing out loud with, 102 
simple text vs., in presentations, 52, 77 
soliciting feedback on, 102-103 
spotting what's wrong exercise, 96 
squint test for checking overall 
impression of, 227 
starting point in reading, 57 
summarizing in one sentence, 242-248 
#SWDchallenge for, 98-99 
table use vs., 52 

thinking critically about what we want 
to show on, 83 
titles of. See Titles 

trying different representations using, 
63 

tying words to graphs exercise for, 
113-119 
types of, 52-53 
zero baseline on, 356 
Grey, for background data, 176 
Gridlines, decluttering of, 123, 124-125, 
145, 202, 319 


Heatmaps 

Excel formatting for, 58, 65 
meals served over time exercise for, 
64, 65 

new client tier share exercise for, 58 
practice with Cole on, 58, 64, 65 
relative intensity of color to indicate 
relative value in, 58 
when to use, 52 
Highlighting 

color for, 313 

focusing attention using, 154, 156, 
259, 309 

techniques for, 227 
visual hierarchy and, 227, 233 
Holden, Kritina, 227 
Horizontal barcharts, 176, 204 

new client tier share exercise for, 58, 
59-60 


practice with Cole on, 58, 59-60 
space for x-axis labels on, 316-317 
usinq two charts for comparison, 
59-60 
when to use, 53 
Horizontal elements 
alignment of, 108 
diagonal text vs., 126 
How to use exercises, 355 
Hue 

changing to focus attention, 170 
as preattentive attribute, 147, 148, 
149, 154 


Information Is Beautiful Awards, 104 
Intensity 

as preattentive attribute, 148, 164, 188 
practice using with your tool, 188 
relative value indicated by, 58 
Interactive Chart Chooser, 104 
Inversing elements, 227 
Italics 

decluttering and, 131-132 
highlighting using, 227 
visual hierarchy and, 227 

J 

Justification 

of axis titles, 202 

of legends, 247 

of text in graphs, 121, 131 

of text in titles, 198, 202 

of titles, 131,202, 214, 228, 247 

K 

Key results, and goals, 380-381 

L 

Labels. See also Axis labels and titles; 

Data labels 

avoiding diagonal text in, 126 
axis charts with, 82, 127-128, 189 
case and typeface of, 227 
centering of, 74, 121 
consistency in, 208 
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color and, 127, 130-131, 158, 160 
data points and, 66, 78, 128, 145, 

173, 187 

decluttering and, 125, 126, 127-128, 
130 

direct labeling of data, 130, 230 
end labels on lines, 173 
focusing attention and, 158, 160 
line charts with, 71, 78, 86, 89, 189 
lines on graphs and, 173, 202 
paying attention to detail exercise for, 
208-209 
placement of, 247 
practice using with your tool, 189 
proximity principle and, 112, 130 
reducing redundancy in, 209 
removing data labels, 127-128 
removing trailing zeros in, 125 
Laundry detergent sales graph, 195-199 
Learning culture, 377 
Legends 

for colors, 65 

extra work moving back and forth to 
use, 88, 130, 209, 244, 247 
heatmaps with, 65 
labeling data directly vs., 130, 230 
line graphs eliminating need for, 71 
placement of, 71, 247, 319 
Library, of data visualization examples, 103 
□dwell, William, 227 
Linear (chronological) path 

data event over time and, 296-297 
moving to narrative arc from, exercise, 
273-274 

as narrative path of story, 237 
storyboard example using, 332-333 
Line graphs. See also Slopegraphs 
annotations used on, 67 
attrition rate exercise for, 78-79, 80 
bar charts vs., 129 
connection principle and, 112 
demand and capacity by month 
exercise for, 71-72, 74 
financial savings exercise for, 85-86 
how to use, 356 
labels on, 71,78, 86, 89, 189 
meals served over time exercise for, 
64, 66-67 


myth about, 356 
plotting differences on, 74 
practice with Cole on, 64, 66-67, 
71-72, 74, 78, 80 
shading used on, 74, 79 
slopegraphs as, 52, 356 
when to use, 52, 83, 356 
Lines in graphs 

decluttering by removing, 124 
end markers and labels on, for 
focusing attention, 173 
focusing attention on, 183, 188 
labeling, 173, 202 

length of, as preattentive attribute, 148 
style of, for focusing attention, 168 
width (thickness) of, 148, 167-168, 
188, 217 

Listening, in feedback sessions, 383 
Logos, in branding, 214, 216, 217, 223, 224 
Lowercase. See also Case titles and, 241 

M 

Main message. See also Big Idea 

crafting and refining during planning, 
1, 10, 48 

Managers 

audience insights from, 6 
culture of learning and, 377 
feedback and, 385 
goals in Objectives and Key Results 
(OKRs) and, 380, 381 
storyboard feedback from, 48 
SWD working session and, 398 
Market size over time graph, 111-112, 
214-219 

Marks, as preattentive attribute, 148 
Meals served over time exercise, 64-67 
Medical centers flu vaccination graph, 
92-93, 184 

Memory 

repetition and, 237, 278 
types of, 148 

Message. See Big Idea; Main message 
Model performance exercise, 306-313 
Monthly voluntary exercise rate graph, 
113-119 
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Motion 

graph example of, 171 
as preattentive attribute, 148 

N 

Narrative arc 

arranging stories along, 252-253, 
254-256, 279-280, 287 
building exercise for, 275 
components of, 281-282 
memorability of stories and, 235 
moving from linear path to, exercise, 
273-274 

stories with, 235, 236, 389 
storyboarding with, 282 
storytelling with, 235, 236, 287, 389 
Narrative flow, 237 

storyboarding to determine, 20, 388 
Narrative structure, 236, 279-280 
Narrowing audience process 

back-to-school shopping exercise, 7-9 
having specific audience in mind, 7 
possible questions to ask, 7-8 
practice at work, 42 
practice on your own, 29 
practice with Cole, 7-9 
primary audiences, 7, 29, 42 
time factors in defining audience, 9 
Negative framing of Big Idea, 13-14, 30, 35 
Net Promoter Score (NPS) exercises, 
239-241,342-353 

New advertiser revenue exercise, 289-293 
New client tier share exercise, 55-63 
NPLs and loan loss reserves exercise, 87-90 

o 

Objectives and Key Results (OKRs), 380-381 
100% stacked charts, 53, 245-246, 247 
Oral presentations. See also Slides 
action for audience to take in, 237 
graphs vs. simple text in, 52 
learning from successes and failures 
of, 400 

live and standalone stories exercise 
for, 257-267 

moving from linear path to narrative 
arc exercise and, 273-274 


pithy, repeatable phrase used in, 278 
practicing out loud with graphs in, 102 
progression building of graph on 
slide during, 259-267 
questions about graphs for, 101 
single-slide presentation, 203-204 
soliciting feedback on, 102-103 
time needed for talking about graphs 
in, 101 

Organizing content, 36-37 
Orientation, as preattentive attribute, 148 
Overlapping bar charts, 72-73 

P 

Pet adoption pilot exercise, 16-17, 24-27, 
254-265 

Physician prescription patterns graph, 
120-122 

Pie charts 

choosing a graph exercise with, 94-95 
difficulties in using, 59 
myth about, 356 

new client tier share exercise for, 58-59 
practice with Cole on, 58-59 
when to use, 59, 356 
Planning process 

audience identification during, 1 
Big Idea crafting and refining during, 
1, 10 
storyboarding 

back-to-school shopping exercise, 
20-23 

overview of process, 3 
pet adoption pilot program 
exercise, 24-27 
planning content, 1,20 
three important aspects of, 1 
time spent on, 1,41,48 
Plot, in narrative arc, 252, 253, 254, 255 
Podcasts, on SWD website, 102, 381 
Point of view, in Big Idea, 10, 18 
Position 

graph example of, 159, 169 
practice using with your tool, 188 
as preattentive attribute, 147, 148, 
149, 164, 188 

Positive framing of Big Idea, 13-14, 30, 35 
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Practice at work 

choosing effective visuals, 100-105 
data visualization library, 103 
discussion questions, 105 
drawing exercise, 100 
iterating with drawing tool, 100 
practicing out loud, 102 
questions about graphs, 101 
resources, 104 
soliciting feedback, 102-103 
focusing attention, 185-189 

discussion questions at work, 189 
figuring out where to focus, 188 
practicing strategies on your tool, 
186-187 

where your eyes are drawn to, 185 
identifying and eliminating clutter, 
144-146 

discussion questions at work, 146 
drawing exercise, 144 
questions to ask yourself, 145 
list of exercises, 379 
more practice exercises, 375-401 
assessment rubric, 389-391 
conducting SWD working session, 
396-398 

creating plan of attack, 379 
cultivating a feedback culture, 
385-387 

discussion questions, 401 
facilitating Big Idea practice 
session, 391-395 
giving and receiving effective 
feedback, 382-384 
how to use exercises, 375 
revisiting SWD process, 388-389 
setting good goals, 380-381 
tips for crafting successful data 
stories, 399-400 
storytelling, 278-283 

discussion questions at work 283 
employing narrative arc, 281-282 
finding exercise, 279-280 
pithy, repeatable phrase, 278 
thinking like a designer, 225-233 

accessibility design notes, 229-230 
creating visual hierarchy, 227-228 
data accessibility with words, 
225-226 


discussion questions at work 233 
garnering acceptance for designs, 
231-232 

paying attention to detail, 228-229 
understanding context, 41-49 
audience, 41-43 
Big Idea, 44-46 
storyboarding, 47-48 
Practice on your own 

choosing effective visuals, 91-99 
choosing a graph, 94-95 
drawing exercises, 91-92 
improving a graph, 92-93 
learning from examples, 98 
spotting what's wrong, 96 
#SWDchallenge, 98-99 
visualizing and iterating, 97 
focusing attention, 178-184 
focusing on bars, 184 
focusing within tabular data, 

182-183 

multiple ways of directing 
attention, 183 

where your eyes are drawn to, 
178-181 

identifying and eliminating clutter, 
138-143 

alignment and white space, 140 
declutter exercises, 141-143 
Gestalt principles, 138-139 
more practice exercises, 355-373 
accounts overtime, 365 
adverse effects, 362 
diversity hiring, 359-360 
encounters by type, 369-370 
errors and reports, 366-367 
how to use exercises, 355 
reasons for leaving, 363-364 
revenue forecast, 361 
sales by region, 360 
store traffic, 371-373 
taste test data, 368-369 
storytelling, 272-277 

building a narrative arc, 275 
evolving from report to story, 
276-277 

identifying tension, 272-273 
moving from linear path to 
narrative arch, 273-274 
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thinking like a designer, 220-224 
branding, 223-224 
learning from examples, 220-221 
making minor changes for major 
impact, 221-222 
ways of improving a graph, 
222-223 

understanding context, 28-40 
audience, 28-29 
Big Idea, 30-35 
storyboarding, 36-40 
website for downloading data and 
graphs for, 91 
Practice with Cole 

choosing effective visuals, 55-90 
critiquing a graph, 84-86 
drawing exercise on paper, 68-70 
drawing exercise using an online 
tool, 70-75 

improving a graph, 87-90 
table improvement, 55-67 
ways of showing attrition rate 
data, 76-80 

weather bar charts, 81-83 
focusing attention, 151-177 
multiple ways of directing 
attention, 164-174 
visualizing all data, 175-177 
where your eyes are drawn to in 
graphs, 156-163 
where your eyes are drawn to in 
pictures, 151-155 
identifying and eliminating clutter, 
111-137 

alignment and white space, 
120-122 

declutter exercise, 123-134 
Gestalt principles, 111-112 
tying words to graphs, 113-119 
more practice exercises, 285-353 
back-to-school shopping, 314—329 
diabetes rates, 330-341 
how to use exercises, 285 
model performance, 306-313 
net promoter score, 342-353 
new advertiser revenue, 289-293 
sales channel update, 294-305 
storytelling, 239-271 


arranging along narrative arc, 
254-256 

differentiating between live and 
standalone stories, 257-267 

identifying tension, 249-251 
putting graph into one sentence, 
242-248 

takeaway titles, 239-241 
transitioning from dashboard to 
story, 268-271 
using components of story, 
252-253 

thinking like a designer, 195-219 
applying branding, 214-219 
paying attention to detail and 
design choices, 206-213 
using default output from tools, 
200-205 

using words wisely on graphs, 
195-199 

understanding context, 5-27 
audience, 5-9 
Big Idea, 10-19 
storyboarding, 20-27 
Preattentive attributes. See also specific 
attributes 

designing for accessibility and, 226 
focusing attention using, 147, 148, 
164,173-174 

Prescription patterns graph, 120-122 
Presentations. See Oral presentations; Slides 
Primary audiences, 7, 29, 41 
Problem statement, in storyboarding, 36 
Proximity principle, 109, 111 

decluttering and, 114-115, 130, 139 
description and use of, 112 
direct labeling of data and, 130 
tying words to graphs and, 114 


Questions, in feedback sessions, 383 

R 

Reasons for leaving graph, 363-364 
Recommendations, in storyboarding, 36 
Reddit: Data Is Beautiful, 104 
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Refining and reframing Big Idea 

back-to-school shopping exercise for, 
13-14 

benefits or risks defined during, 13, 

14, 30 

compare and contrast in, 13 
planning process and, 1,10 
positive and negative framing 

exercises for, 13-14, 30, 35 
practice on your own for, 30 
practice with Cole on, 13-14 
Repetition 

forming pithy, repeatable phrase 
exercise and, 278, 287 
memory and, 237, 278 
Reports, 237 

evolving to story from, exercise, 276-277 
learning from successes and failures 
of, 400 

title consistency with, 199 
Resonate (Duarte), 3, 10 
Resources, for choosing effective graphs, 
104 

Response and completion rates exercise, 
96 

Revenue forecast exercise, 361 
Revising, during storyboarding, 39-40 
R Graph Gallery, 104 
Ricks, Elizabeth Hardman, 330 
Rising action, in narrative arc, 252, 253, 
254, 255 

Risks or benefits definition, in refining Big 
idea, 13, 14, 30 

s 

Saint-Exupery, Antoine de, 192 
Sales by region exercise, 360 
Sales channel update exercise, 294-305 
Scatterplots 

back-to-school shopping example of, 
315 

when to use, 52, 83 
Shading 

enclosure principle and, 112, 116 

line graphs with, 74, 78 

tables with, 57, 65 

time intervals shown by, 89 

tying words to graphs using, 116-117 


Shape, as preattentive attribute, 148 
Similarity principle, 109, 111 

color in focusing attention and, 159 
decluttering and, 115-116, 117, 119, 
139 

description and use of, 112 
tying words to graphs and, 115-116, 
117,119 

Size 

as preattentive attribute, 147, 148, 
149, 164 

as signal on where to look, 154 
visual hierarchy and, 227 
Sketching. See Drawing exercises 
Slideument, 257 
Slides 

alignment and white space on, 121 
audience's path on, 204 
practicing out loud with, 102 
progression building of graph on, 
259-267 

single-slide presentation, 203-204, 213 
takeaway titles for, 239-240, 241 
title guidelines for, 199 
tying words to graphs on, 113-119, 229 
where your eyes are drawn to, 151, 
178,185 

Slopegraphs, 356 

choosing a graph exercise with, 

94-95 

comparison using, 62-63 
as line graphs, 52, 356 
practice with Cole on, 62-63 
when to use, 52, 83 
Solutions to exercises for practice. See 

Practice at work; Practice on 
your own; Practice with Cole 
"So what?" sentence, 309 

critiquing graphs using, 85 
storytelling and, 279 
understanding context and, 3 
using effective visuals and, 77 
Spatial position 

graph example of, 159, 169 
practice using with your tool, 188 
as preattentive attribute, 147, 148, 
149, 164, 188 
Square area charts, 53 
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Squint test, 227 
Stacked charts 

choosing a graph exercise with, 94-95 
demand and capacity by month 
exercise for, 69, 73, 74 
horizontal, 58, 60, 68, 69, 176, 204, 
316-317 

100% stacked, 53, 245-246, 247 
practice with Cole on, 69, 73, 74 
vertical, 53, 61,316 
when to use, 53 
zero baseline on, 53 
Starting place 

focusing attention and, 149 
key takeaway of graph in title and, 
226, 230 

legend placement and, 247 
tables and, 57 

testing where your eyes are drawn to 
and, 149 

titles of graphs and, 198, 202, 226, 
230, 239 

two-sided layout and, 204 
zigzagging "z" path in information 
processing on graphs and, 
57, 131, 198, 228, 247, 317 

Sticky notes 

building narrative arc using, 275 
moving from linear path to narrative 
arc exercise using, 274 

storyboarding using, 3, 20, 46 
storytelling using, 252, 253 
Store traffic exercise, 371-373 
Story. See also Storytelling 

answering "So what?" question to 
find, 279 

with capital "S," 279, 280 
components of, 235 
memorability of, 235 
narrative arc for structuring, 279-280 
with lowercase "s," 279, 280 
Storyboarding, 20-27, 388 

arranging potential components of, 
36-37 

back-to-school shopping exercise for, 
20-23 

CFO's update exercise for, 37-38 


critiquing, 39 
definition of, 20 
discard pile in, 47 
narrative arc with, 282 
organizing ideas in, 47 
overview of process in, 3, 286 
pet adoption pilot program exercise 
for, 24-27 

practice at work for, 47-48 
practice on your own for, 36-40 
practice with Cole on, 20-27 
planning content using, 1,3 
brainstorming step in, 3, 20, 22, 24, 
26, 37-38, 46, 286 
getting feedback step in, 3, 21,23, 

25, 27, 38, 48, 286 
editing step in, 3, 20, 24-25, 38, 286 
revising, 39-40 

stakeholder and manager feedback 
on, 48 

sticky notes used in, 3, 20, 46 
storytelling and, 287 
three steps of, 3 

university elections exercise for, 38-40 
Storytelling, 235-283, 389 

call breakdown overtime graph and, 
242-248 

evolving from report to story exercise, 
276-277 

examples of approaches to, 285 
feedback after presenting, 400 
key elements of, 236, 252 
live and standalone stories exercise, 
257-267 

narrative arc in, 236, 287, 389 
arranging stories along arc, 

252-253, 254-256, 279-280 
building exercise, 275 
components, 281-282 
memorability of stories, 235 
moving from linear path to 
narrative arc exercise, 273-274 
narrative flow in, 237 
narrative structure of, 236, 279-280 
Net Promoter Score (NPS) over time 
graph and, 239-241 
overview of (main lessons from SWD 
book), 236-237 
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pet adoption pilot program exercise 
and,254-265 

potency of stories and, 235 
practice exercises for, 238-283 
overview, 238 
practice at work, 278-283 
practice on your own, 272-277 
practice with Cole, 239-271 
putting graph into one sentence 
exercise and, 242-248 
repetition in, 237 

spoken vs. written narrative in, 237 
steps in, 286-287 
#SWDchallenge for, 98-99 
SWD working session on, 397 
takeaway titles exercise for, 239-241 
tension and, 236, 279 

action to be taken, 249, 251,272 
identifying exercises, 249-251, 
272-273 

memorability of stories, 235 
story shape, 252, 253, 255 
time to fill open roles graph and, 
257-267 

tips for crafting successful data stories 
in, 399-400 

transitioning from dashboard to story 
exercise and, 268-271 
using components of story exercise 
and,252-253 

Storytelling with Data (SWD) book, review 
of main lessons 
context, 2-3 
effective visuals, 52-53 
focusing attention, 148-149 
identifying and eliminating clutter, 
108-109 

storytelling, 236-237, 286-287 
thinking like a designer, 192-193 
Storytelling with Data website 

(storytellingwithdata.com) 
assessment rubric on, 391 
Big Idea worksheet on, 44 
blog on, 98, 229 

downloading data and graphs for 
exercises from, 91 
podcasts available on, 102, 381 
resources on, 405 


#SWDchallenge on, 98-99, 386 
SWD process son, 389 
SWD blog, 98, 229 
#SWDchallenge, 98-99, 386-387 
SWD process 

assessment rubric for, 389-391 
conducting SWD working session in, 
396-398 

discussion questions on, 401 
downloadable version of, 389 
resources for, 405 
summary of steps of, 388-389 
using as final assessment of project, 391 

T 

Tableau Public Gallery, 104 
Tables 

appropriate level of detail in, 56 
embedded bars used in, 58 
embedding in graphs, 93 
focusing attention to establish 

hierarchy of information in, 57 
focusing on contents in, 182-183 
graphs vs., in live presentations, 52 
heatmaps as, 52 
shading vs. white space in, 57 
starting point in using, 57 
table improvement approaches 

meals served overtime exercise, 
64-67 

new client tier share exercise, 
55-63 

steps in analyzing a table, 55 
trying different representations using, 
63 

when to use, 52 
Takeaway titles 

accessibility and, 230 
color for, 158 

primary point made in, 241 
priming audience using, 170 
storytelling and, 239-241 
thinking like a designer and, 199, 

226, 389 

transitioning from dashboard to story 
using, 271 
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Taste test data exercise, 368-369 
Team, creating Big Idea working as, 46 
Teams 

cross-functional, 376 
developing, 376 

effective data visualization goal of, 193 
embracing constraints by, 377 
feedback time in meetings of, 385, 386 
Objectives and Key Results (OKRs) 
and, 380-381 

#SWDchallenge and, 386-387 
Tension 

action to be taken and, 249, 251,272 
identifying exercises and, 249-251, 
272-273 

memorability of stories and, 235 
narrative arc with, 279 
story shape with, 252, 253, 255 
Text boxes 

alignment of, 121,216 
closure principle and, 112 
position of, 216 
white space around, 121 
Text in graphs. See also Words in graphs 
alignment of, 120, 121, 122 
branding and size of, 217, 224 
centering of, 57, 121, 131,208, 216, 
217, 291 

different audience perspectives 
resulting in different interpretations of, 197 
graphs vs. simple text, in 
presentations, 52 

justification of, 121, 131 
simple text vs. visuals, 52, 77 
two-sided layout using, 204 
white space with, 120, 121-122 
Thinking like a designer, 191-233, 388-389 
acceptance by audience of designs 
and, 191, 193, 220, 231-232 
accessibility and, 191, 193, 220, 
229-230 

aesthetics and, 191, 193, 220 
affordances and, 191, 192, 220 
axis titles and, 198, 202 
branding exercises for, 214-219, 
223-224 

car sales over time graph and, 
200-205 


critical role of words in communicating 
data and, 195, 197, 199 
data accessibility with words exercise 
and,225-226 

designing data to make sense in, 
209-212 

different audience perspectives 
resulting in different 
interpretations of text and, 
197 

discussion questions at work and, 233 
form follows function in, 192, 287 
laundry detergent sales graph and, 
195-199 

learning from examples in, 220-221 
making minor changes for major 

impact exercise and, 221-222 
market size over time graph and, 
214-219 

overall design critique in, 201,202, 
205, 229 

overall "feel" of, 229 
overview of (main lessons from SWD 
book), 192-193 
paying attention to detail and, 
206-213, 228-229, 389 
practice exercises for, 194-233 
overview, 194 
practice at work, 225-233 
practice on your own, 220-224 
practice with Cole, 195-219 
storytelling and, 287 
SWD working session on, 397 
takeaway titles and, 199, 226, 388 
touchpoints per customer over time 
graph and, 206-213 
two-sided layout in, 204 
using default output from tools 
exercise, 200-205 
using words wisely on graphs 
exercise, 195-199 
visual hierarchy and, 201,202, 
227-228, 389 

ways of improving a graph exercise 
and,222-223 
word choice and, 200, 202 
3-minute story, 3 
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Time 

planning and need for, 1,41, 48 
time intervals on graphs, 88-89 
Time factors 

defining audience using, 9 
feedback session scheduling and, 382 
footnotes on, 226 
talking about graphs and, 101 
Time to close deal graph, 123-137 

decluttering exercise with, 123-134 
focusing attention and, 135-137 
Time to fill open roles graph, 257-267 
Titles. See also Takeaway titles 

of axis. See Axis labels and titles 
of graphs 

branding, 217, 218, 224 
centering, 202 
color, 132 

decluttering, 131, 132 
guidelines for, 199, 225 
orientation, 131 
placement of, 131,202, 214, 

228, 247 

starting place in graphs, 198, 

202, 226, 230, 239 
takeaway titles, 226, 230, 
239-241,388 
two-sided layout and, 204 
of slides and oral presentations 
pithy, repeatable phrase, 278 
word choice, 239, 241 
Title text in graphs 

alignment of, 198, 202 

case (capitalization) of, 227, 241 

color of, 160 

focusing attention using, 158, 160, 
170, 202 

key takeaway of graph placed in, 226, 
230, 239-241,388 
typeface of, 227 

wording changes for clarity of, 197-198 
Tools 

default output exercise with, 200-205 
drawing exercises using, 70-75, 92 
Touchpoints per customer over time graph, 
206-213 

Two-sided graph layout 

detail and design choices exercise for, 
206-213 


revising content to create, 204 
transitioning from dashboard to story 
using, 271 

Typeface 

as signal on where to look, 154 
visual hierarchy and, 227 

u 

Underlining, for highlighting, 227 

United Airlines, 216-217 

Universal Principles of Design (Lidwell, 

Holden, and Butler), 227 

University elections exercise, 33-35, 38-40 

Uppercase. See All caps; Capitalization; Case 

V 

Vaccine rate exercise, 17-18 
Vertical bar charts 

comparison using, 61 
horizontal bar chart vs., 316 
practice with Cole on, 61 
when to use, 53 

Vertical elements, alignment of, 108 
Visual hierarchy 

focusing attention to establish, 57 
practice at work exercise and, 227-228 
thinking like a designer and, 201, 

202, 227-228, 389 
tips on, 227-228 

Visualizing data, 51. See also Effective 

visuals, choosing 

Visuals 

choosing. See Effective visuals, 
choosing 

simple text vs., 52, 77 
types of, 52-53 

w 

Waffle charts, 53 
Waterfall charts, 53 
Weather forecast exercise, 81-83 
Web Content Accessibility Guidelines, 230 
"Where are your eyes drawn?" test, 388 
combining preattentive attributes 
and,173-174 
graphs and, 156-163 
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how to perform the test, 149, 151 
pictures and, 151-155 
practice at work and, 185 
practice on your own and, 178-181 
practice with Cole and, 151-163 
tables and, 57 
White space 

accessibility and, 230 
bar graphs with, 71 
effectiveness of graphs and, 112 
decluttering and, 120-122, 140, 286 
market size over time graph and, 112 
paying attention to detail with, 228 
physician prescription patterns graph 
for, 120-122 

practice on your own and, 140 
practice with Cole and, 120-122 
two-sided layout using, 204 
visual order using, 108 
when to use, 230 

Words in graphs. See also Text in graphs 
axis label clarity and, 197, 198, 320 
branding and, 217 
critical role of, 195, 197, 199, 241 
data accessibility with words exercise, 
225-226 

different audience perspectives 
resulting in different 
interpretations of, 197 
key takeaway and, 226 
laundry detergent sales graph and, 
195-199 

memorability of stories and, 235 
monthly voluntary exercise rate 
graph and, 113-119 
payinq attention to detail exercise 
and, 213 

slide titles and, 239, 241 
takeaway titles exercise and 
choosing, 239-241 

transitioning from dashboard to story 
using strategic choice of, 271,277 

tying words to graphs and, 113-119, 
229 

using words wisely exercise and, 
195-199 


X 

x-axis 

connection principle and, 112 
eliminating diagonal text on, 126 
labels for. See Axis labels and titles 
practicing out loud with graphs and, 
102 

time intervals on, 88-89 
Xenographies, 104 

Y 

y-axis 

connection principle and, 112 
dual y-axes, 88 

labels for. See Axis labels and titles 
practicing out loud with graphs and, 
102 

secondary, 308 
starting at zero, 82 

z 

zigzagging "z" path in information 

processing on graphs, 57, 
131, 198, 228, 247, 317 
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Praise for Resonate 


“Nancy Duarte boils down great presentations to the 
essence of what connects people—great stories. As a 
leader, you need to connect, convince, and change 
people with your words—so don’t you dare start your 
next presentation without consulting this book first!” 

Charlene Li 
Author, Open Leadership 
Founder, Altimeter Group 


“Storytelling, empathy, and creativity are fundamental 
to the way we communicate, learn, and grow. Resonate 
teaches us how to access and master these gifts in 
meaningful and productive ways.” 


Biz Stone 
Twitter Co-Founder 


“Resonate takes you on a beautiful journey illustrating 
how to construct and deliver the kind of presentations 
that are truly remarkable, memorable...and may even 
change the world. Anyone with ambition to make a 
difference in this world needs to get this important book. 
Nancy has made another remarkable contribution!” 

Garr Reynolds 

Author of Presentation Zen and The Naked Presenter 


“TED knows first-hand how ideas that spread change the 
world. If you read Resonate, you’ll learn how to present 
ideas that stand out, are repeated, and create change.” 

Tom Rielly 

Community Director, TED Conferences 


“Nancy knows a secret, and she’s not shy about 
sharing it: If you are intentional about your 
presentations, if you tell a story on purpose, if 
you set out to cause the change you say you want, 
you’ll succeed. This book goes a long way in selling 
you on making that choice.” 


Seth Godin 
Speaker, Blogger, and Author 


“There is a stark difference between facts and a 
story, between an image and a design, between 
conveying information and moving people. These 
differences distinguish people who yell but aren’t 
heard from those whose whispers resonate loudly 
and clearly. This is a gorgeous book. Powerful ideas, 
visually delectable, and with life-changing insight.” 

Jennifer Aaker 
General Atlantic Professor of Marketing 
at Stanford Graduate School of Business 
and Co-Author of The Dragonfly Effect 


“At the heart of leadership and learning is great 
storytelling. Resonate will both inspire and give 
you the tools to teach, motivate, and encourage 
audiences not just to listen but to change and to 
act...and the world needs a lot more of that! This 
book is a keeper, one to be read and reread by 
anyone in the business of persuasion.” 

Jacqueline Novogratz 
CEO of Acumen Fund 
and Author of The Blue Sweater 


“She’s done it again! Far more than an ‘encore perfor¬ 
mance,’ Resonate has everything—‘the beginning, 
middle, and end’—to make your pitch sing. 

Bravo, Nancy!” 


Raymond Nasr 
Advisor, Twitter, Inc. 


“As Nancy Duarte knows better than anyone, it’s not 
about the slides. This smart, insightful book will teach 
you how to use the power of story to recast your 
thinking and reinvigorate your presentations. Resonate 
is a must-read for anyone who has to stand before an 
audience and persuade.” 


Daniel H. Pink 
Author of DRIVE and A Whole New Mind 


“Duarte’s approach takes the reader ‘through’ some¬ 
thing-transformation is the key—being purposeful 
about how and where you take your audience is 
something that most presenters fail to even consider, 
let alone construct.” 


Dan’l Lewin 

Corporate Vice President, Microsoft Corporation 


“Nancy is one of the great storytellers of our time. 
Thanks for letting us take a look behind the curtain 
and learn from your giftedness!” 


Mark Miller 

Vice President, Training and Development, Chick-fil-A 


“The next time you need to tell a story, be sure to 
have a copy of Nancy’s book as your trusty guide. 
Through the ups and downs, the tears and the 
applause, Nancy will help you find your way through 
the often mysterious, fascinating, and powerful 
world of storytelling.” 


Cliff Atkinson 

Author of Beyond Bullet Points and The Backchannel 


“The problem with most presentations is that speakers 
don’t have a compelling story to tell before they open 
PowerPoint. Resonate solves the problem. It’s another 
magnificent book by the world’s top presentation 
designer, Nancy Duarte. It will hold a permanent 
place next to Duarte’s Slide:ology on my bookshelf!” 

Carmine Gallo 
Communication Skills Coach 
and Author of The Presentation Secrets of Steve Jobs 


“Resonate shows you how to evolve information into 
a story that connects with your audiences and rallies 
them to action. This powerful and groundbreaking 
work is essential reading for executives, entrepreneurs, 
students, teachers, civil servants—everyone with ideas 
and the desire to move them forward.” 


Karen Tucker 
CEO, Churchill Club 
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I miss you Daddy. 


“The mystery lies in the use of language to express human life.” 

Eudora Welty 
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Foreword 


Great presentations are like magic. They amaze their 
audiences. And great presenters are like magicians. In 
addition to practicing regularly, both are reluctant to 
reveal the methods behind their performances. Among 
magicians, it is acceptable to reveal secrets to those 
who are committed to learning the art and becoming 
serious magicians. In that same context, Nancy Duarte 
is offering a one-of-a-kind learning opportunity, avail¬ 
able to those who take presentations seriously. 

Resonate cracks the code on how 
to orchestrate the invisible attributes 
that shape transformative audience 
experiences. 

It all starts with becoming a better storyteller. Possessing 
the power to influence the beliefs of others and create 
acceptance of new ideas is timeless. The value of story¬ 
telling transcends language and culture. As we move rap¬ 
idly toward a future of improved connections between 
people, cross-pollinated creativity, and digital effects, 
stories still represent the most compelling platform we 
have for managing our imaginations—and our infinite 
data. More than any other form of communication, the 
art of telling stories is an integral part of the human 
experience. Those who master it are often afforded 
great influence and enduring legacy. 


XVI 


Nancy Duarte understands how to align ideas to create 
world-shaping responses. No one else has been so exclu¬ 
sively focused on mastering the presentation space as a 
discipline, and few have worked across a broader spec¬ 
trum of client profiles and communication challenges. 
She is passionate about building systems that make 
creative results replicable and scalable. 

Nancy Duarte has an insatiable curiosity 
about processes—with a relentless drive 
to codify practices that have defied speci¬ 
fication by others. 

Over two decades and dozens of economic cycles, she 
has attracted extraordinary talent to her organization, 
and with a singular vision has established it as the indus¬ 
try leader. In fact, Duarte Design now lays claim to work¬ 
ing with half of the world's top fifty brands, and many of 
its most innovative thinkers. The analysis and insights in 
this book are distinctly competent. 

Knowing a magic secret doesn’t make you a magician. 
You have to do more than just read instructions. Without 
exception, great presenters are very deliberate about 
learning how to refine and reveal their ideas. They hone 
their words, sweat the structure, and practice their craft 
rigorously. They constantly seek and adapt to feedback. 


If great presentations were easy to build 
and deliver, they wouldn’t be such an 
extraordinary form of communication. 
Resonate is intended for people with 
ambition, purpose, and an uncommon 
work ethic. 

Applied with passion and purpose, the concepts in this 
book will accelerate your career trajectory or propel your 
social cause. At Duarte Design, we see it happen every 
day. Few pursuits in professional self-improvement have 
as much potential leverage. All you really need is an idea. 
Most of the influential presenters throughout history- 
including those profiled in this book—started with one 
really good idea. You may be incubating that class of 
idea, or you may be one slide deck away from it; either 
way, you need to get it out in the open so that we can all 
benefit from it. 

Nancy Duarte wants you to become a thought leader. 
She hopes you can give the rest of us the structure and 
direction we need to navigate challenges and opportuni¬ 
ties, and help us interpret our goals. She expects you to 
make sense of chaos. She dares you to be transparent 
and evocative, motivational and persuasive. Above all, 
she trusts you to inspire action for our greater good. 

Dan Post 

President and Principal, Duarte Design 
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Introduction 


Language and power are inextricably linked. The 
spoken word pushes ideas out of someone's head and 
into the open so humankind can contend with adopt¬ 
ing or rejecting its validity. Moving an idea from its 
inception to adoption is hard, but it’s a battle that 
can be won simply by wielding a great presentation. 

Presentations are a powerfully persuasive tool, and 
when packaged in a story framework, your ideas 
become downright unstoppable. Story structures 
have been employed for hundreds of generations 
to persuade and delight every known culture. 

Two years ago, I set out to uncover how story applies 
to presentations. There seemed to be a storylike magic 
to the presentations that caused change and spread 
broadly. Since I already had the context of thousands 
of presentations my firm had created for smart 
companies and causes, I studied what I didn’t know: 
screenwriting, literature, mythology, and philosophy- 
allowing myself to be led on a fascinating journey. 


Early in my research, I stumbled on this graphic made 
in 1863 by German dramatist Gustav Freytag that he 
used to visualize the five-act structure popular in Greek 
and Shakespearean dramas. It shows the “shape” of a 
dramatic story. The drama builds toward a climax and 
then resolves. 


CLIMAX 



EXPOSITION 


DENOUEMENT 




When I saw Freytag’s pyramid, I knew that powerful 
presentations must also have a contour. I just didn’t 
quite know what the shape looked like yet. I also knew 
that presentations are different from dramatic stories 
because in a presentation, it’s rare to have a lone protago¬ 
nist whose story builds toward a single climactic moment. 
Presentations have more layers and have disparate pieces 
of information to convey. Dramatic stories have a single 
climax as the crowning event whereas great presentations 
move along with multiple peaks that propel them forward. 

I’ll never forget the Saturday morning when I finally 
sketched out a shape. I knew that if it was accurate, I 
should be able to overlay it onto two very different yet 
game-changing presentations. So I painstakingly ana¬ 
lyzed Steve Jobs’s 2007 iPhone launch and Martin Luther 
King Jr.’s “I Have a Dream” speech. Both mapped to the 
form I had sketched. I cried. Literally. It felt like such a 
mystery had been revealed. 

There’s something sacred about stories. They have an 
almost supernatural power that should be wielded 
wisely. Religious scholars, psychologists, and mytholo- 
gists have studied stories for decades to determine the 
secret to their power. 


It’s still the dawn of the information age, and we are all over¬ 
whelmed with too many messages bombarding us and trying 
to lure us to acquire and consume information (then repeat the 
process over and over). We are in a more selfish and cynical 
age, which makes it tempting to be detached. Technology has 
given us many ways to communicate, but only one is truly human: 
in-person presentations. Genuine connections create change. 

You’ll notice that change is a theme throughout the book. Most 
presentations are delivered to persuade people to change. All 
presentations have a component of persuasion to them. This 
notion may ruffle some feathers. But isn’t there usually a desired 
outcome from what’s classified as an informative presentation? 
Yes. You’re moving your audience from being uninformed to 
being informed. From being uninterested in your subject to being 
interested. From being stuck in a process to being unstuck. Many 
times the audience needs to do something with the information 
you’re conveying, which makes your presentation persuasive. 

So whether you’re an engineer, teacher, scientist, executive, 
manager, politician, or student, presentations will play a role in 
shaping your future. The future isn’t just a place you’ll go; it’s 
a place you will invent. Your ability to shape your future depends 
on how well you communicate where you want to be when you 
get there. 


How to Use Resonate 

Resonate is a prequel to my first book, Slide.ology. When 
I wrote Slide.ology, I thought the most pressing need in 
communications was for people to learn how to visually 
display their brilliant ideas so they were clearer and less 
overwhelming for the audience to process. Come to 
find out, there was a much deeper problem. Gussying 
up slides that have meaningless content is like putting 
lipstick on a pig. 

Presentations are broken systemically, and the meth¬ 
odology in Resonate uses story frameworks to create 
presentations that will engage, transform, and activate 
audiences. After more than twenty years of developing 
presentations for the best brands and thought leaders 
in the world, we’ve codified our Visual Story™ method¬ 
ology so you can change your world! 
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Below are design elements to be aware of: 

• The green www symbol signifies there’s additional 
material at www.duarte.com about the subject. 

• The Presentation Form™ is used as an analysis tool 
throughout the book and is visually expressed as 
a sparkline (a term developed by Edward Tufte). 

• The bold text is for the reader who wants to just 
skim and get a nugget from each spread. 

• The blue body text signifies my personal stories 
or excerpts from speeches. 

• There are citations from several sources in the body 
copy, but some deserved extra emphasis and are 
pulled out in orange text. 

This book is simultaneously an explanation, a how-to guide, 
and a business justification for story-based messaging. It 
will take you on a journey to a level of presentation literacy 
that very few have mastered. Using techniques from story 
and cinema, you will understand key steps for connecting 
to the audience, deferring to them as the hero, and creating 
a presentation that resonates. 




Invest Your Time 

Be forewarned: A high-quality in-person presentation 
takes time and planning, yet pressure on our time pre¬ 
vents us from preparing high-quality communications. 

It takes discipline to be a great communicator—it’s a 
skill that will bring a big payoff to you personally and 
to your organization. 

But a recent survey conducted by Distinction had some 
startling findings. Of the executives surveyed, over 86 
percent said that communicating clearly impacts their 
careers and incomes yet only 25 percent put more than 
two hours into preparing for very high-stakes presenta¬ 
tions. That’s a big gap. 

The result of investing in an important presentation 
is unparalleled in any other medium. When an idea is 
communicated effectively, people follow and change. 
Words that are carefully framed and spoken are the 
most powerful means of communication there is. The 
lifework of the communicators featured in this book 
are proof. 


Hope you enjoy, 


DISTINCTION COMMUNICATION EXECUTIVE SURVEY RESULTS 


1 

How would you 
rank the impor¬ 
tance of personal 
presentation skills 
in what you do? 

2 

What do you 
find the most 

challenging part 
of creating a 
presentation? 

3 

How much time 
do you spend 
practicing for a 
“high-stakes” 
presentation? 

86.1 % 

Communicating 
with clarity directly 
impacts my career 

and income. 

35 . 7 % 

Putting together 
a good message. 

12 . 1 % 

1 seldom have time 

to practice at all. 

13 . 8 % 

1 present from 

time to time but 

the stakes don’t 

seem all that high. 

8 . 9 % 

Creating 
quality slides. 

16 . 2 % 

5-30 minutes. 

0 % 

1 don’t do 
any formal 
presentations. 

13 . 8 % 

Delivering the 
presentation with 

confident skills. 

17 . 0 % 

30 minutes 

to one hour. 


41 . 1 % 

All of the above! 

29 . 2 % 

One to two hours. 



25 . 2 % 

More than 

two hours! 


©www. distinction-services, com 
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Persuasion Is Powerful 


Movements are started, products are purchased, philosophies are adopted, 
subject matter is mastered—all with the help of presentations. 

Great presenters transform audiences. Truly great communicators make it 
look easy as they lure audiences to adopt their ideas and take action. This 
isn’t something that just happens automatically; it comes at the price of 
long and thoughtful hours spent constructing messages that resonate 
deeply and elicit empathy. 

Throughout the book, you’ll learn from some of the greatest communicators. 
Each is different and yields a unique insight, yet they share a common thread: 
They all create a groundswell of support for their ideas. These communica¬ 
tors don’t have to force or command their audiences to adopt their ideas. 
Instead, the audience responds willingly with a surge of support. 


Great 

Communicators 



MOTIVATOR 
Benjamin Zander, 
Conductor, Boston 
Philharmonic Orchestra 


MARKETER 
Beth Comstock, 
Chief Marketing 
Officer, GE 


POLITICIAN 
Ronald Reagan, 
Former President of 
the United States 


CONDUCTOR 
Leonard Bernstein, 
Conductor, New York 
Philharmonic Orchestra 


Resonate 






LECTURER 
Richard Feynman, 
Professor, California 
Institute of Technology 


PREACHER 
John Ortberg, 
Pastor, Menlo Park 
Presbyterian Church 


EXECUTIVE 
Steve Jobs, 

Chief Executive 
Officer, Apple Inc. 



ACTIVIST 

Martin Luther King Jr., 
Civil Rights Activist 


ARTIST 

Martha Graham, 
Contemporary Dancer 
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Resonance Causes Change 


Presentations are most commonly delivered to per¬ 
suade an audience to change their minds or behavior. 
Presenting ideas can either evoke puzzled stares or 
frenzied enthusiasm, which is determined by how well 
the message is delivered and how well it resonates 
with the audience. After a successful presentation, 
you might hear people say, “Wow, what she said really 
resonated with me.” 

But what does it mean to truly resonate with someone? 

Let’s look at a simple phenomenon in physics. If you 
know an object’s natural rate of vibration, you can 
make it vibrate without touching it. Resonance occurs 
when an object’s natural vibration frequency responds 
to an external stimulus of the same frequency. To the 
right is a beautiful visualization of resonance. My son 
poured salt onto a metal plate that he then hooked 
up to an amplifier so that the sound waves traveled 
through the plate. As the frequency was raised, the 
sound waves tightened and the grains of salt jiggled, 
popped, and then moved to a new place, organizing 
themselves into beautiful patterns as though they knew 
where they “belonged.” www 


How many times have you wished that students, employ¬ 
ees, investors, or customers would snap, crackle, and pop 
to exactly where they need to be to create a new future? 

It would be great if audiences were as compliant and 
unified in thought and purpose as these grains of salt. 
And they can be. If you adjust to the frequency of your 
audience so that the message resonates deeply, they, 
too, will display self-organizing behavior. Your listen¬ 
ers will see the place where they are to move to create 
something collectively beautiful. A groundswell. 

The audience does not need to tune themselves to 
you—you need to tune your message to them. Skilled 
presenting requires you to understand their hearts and 
minds and create a message to resonate with what’s 
already there. Your audience will be significantly moved 
if you send a message that is tuned to their needs and 
desires. They might even quiver with enthusiasm and 
act in concert to create beautiful results. 


- t Tare urt- 

uml.dMrfr-to* 
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Change Is Healthy 


Presentations are about change. Businesses, and 
indeed all professions, have to change and adapt in 
order to stay alive. 

Organizations go through a life cycle of starting up, 
growing, maturing, and eventually declining—that is, 
unless they reinvent themselves. A business is usu¬ 
ally founded because someone came up with a clear 
vision of the world in the future as an improved place. 
But that improved world quickly becomes an ordinary 
world. Once an organization arrives at maturity, it 
can't get too comfortable. To avoid potential decline, 
it must alter and adapt its strategy so it’s at the right 
place at the right time in the future. If an organiza¬ 
tion doesn't take a new path, it will eventually wither. 
Communicating each move carefully to all stakehold¬ 
ers and clients becomes critical. 

It takes gutsy intuitive skills to move toward an 
unknown future that involves unfamiliar risks and 
rewards, yet businesses must make these moves to 
survive. Companies that learn to thrive in the chronic 
flux and tension between what is and what could be 
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are healthier than those that don’t. Many times the 
future cannot be quantified with statistics, facts, or 
proofs. Sometimes leaders have to let their gut lead 
them into uncharted territories where statistics haven’t 
yet been generated. 

An organization should make continual shifts and 
improvements to stay healthy. That makes even simple 
presentations at staff meetings a platform for persua¬ 
sion. You need to persuade your team to self-organize 
at a distinct place in the future or it could bring the 
demise of the organization. 

Getting ahead of the next curve requires courage and 
communication: Courage to determine the next bold 
move, and communication to keep the troops commit¬ 
ted to the value of moving forward. 

Rallying stakeholders to move together in a common 
course of action is all part of the innovation and survival 
process. Leaders at every level in an organization need 
to be skillful at creating resonance if that organization 
is to control its own destiny. 


SALES 


Business Transformation 



“Progress is impossible without change; 
and those who cannot change their minds 
cannot change anything.” 

George Bernard Shaw 
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Presentations Are Boring 


Presentations are the currency of business activity 
because they are the most effective tool to transform 
an audience, yet many presentations are boring. Most 
are a dreadful failure of communication, and the rest are 
simply not interesting. Could there be a way to resusci¬ 
tate them to a point where they not only show signs of 
life but actually engage audiences with rapt attention? 

If you’ve been trapped in a bad presentation, you 
recognize the feeling almost immediately. You can tell 
within minutes that it’s just not good; it doesn’t take 
long to recognize a corpse! To make matters worse, it’s 
becoming more and more difficult to keep an audience’s 
attention as global cultures become media-rich envi¬ 
ronments. Slick ad agencies and Hollywood producers 
spend enormous amounts of time and money to build a 
pulse and rhythm into their media. While entertainment 
has raised the bar for audience engagement, presenta¬ 
tions have become less engaging than ever. 

So why then, if presentations are so bad, are they 
scheduled? People inherently know that connecting in 
person can yield powerful outcomes. We crave human 
connection. Throughout history, presenter-to-audience 
exchanges have rallied revolutions, spread innovation, 
and spawned movements. Presentations create a 


catalyst for meaningful change by using human contact 
in a way that no other medium can. Many times it isn’t 
until you speak with people in person that you can estab¬ 
lish a visceral connection that motivates them to adopt 
your idea. That connection is why average ideas some¬ 
times get traction and brilliant ideas die—it all comes 
down to how the ideas are presented. 

Presentations with a pulse have an ebb and flow to them. 
Those bursts of movement result from contrast—contrast 
in content, emotion, and delivery. In the same way that 
your toe taps to a good beat, your brain enjoys tapping 
into ideas when something new is continually developing 
and unwrapping. Interesting insights and contrasts keep 
the audience leaning forward, waiting to hear how each 
new development resolves. 

It takes a lot of work to breathe life into an idea. Creating 
an interesting presentation requires a more thoughtful 
process than throwing together the blather that we’ve 
come to call a presentation today. Spending energy to 
understand the audience and carefully crafting a message 
that resonates with them means making a commitment of 
time and discipline to the process. 

There is a simple way to determine whether it’s worth 
putting this level of commitment into a presentation... 


Just ask yourself: How badly do I want my idea to live? 
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The Bland Leading the Bland 


The presenter's job is to make the audience clearly 
"see” ideas. If your ideas stand out, they’ll be noticed. 

The enemy of persuasion is obscurity. 

You can learn what attracts attention by examining 
the opposite: camouflage. The purpose of camouflage 
is to reduce the odds that someone will notice you—by 
blending into an environment. When is blending in 
appropriate for a communicator? Never. The more you 
want your idea adopted, the more it must stand out. If 
the idea blends with the environment, both its clarity 
and chances for adoption are diminished. An audience 
should never be asked to make decisions based on 
unclear options. 

Don’t blend in; instead, clash with your environment. 
Stand out. Be uniquely different. That’s what will draw 
attention to your ideas. Nothing has intrinsic attention- 
grabbing power in itself. The power lies in how much 
something stands out from its context. If you go hunt¬ 
ing with your college buddies and don’t want to be 
confused with their prey, you’d be advised to wear 
safety orange. Since there’s nothing in the woods that 
particular color, you’ll stand out. 


In communications, standing out from the “environment” 
means standing out among your competitors or even 
contrasting with your own organization. You must show 
how your idea contrasts with existing expectations, 
beliefs, feelings, or attitudes if you want to gain the audi¬ 
ence’s rapt attention. It certainly feels safer and easier 
to conform to the well-worn groove of sameness than to 
stand out and be vulnerable. But being buried in a sea of 
sameness does not yield greatness or solve big problems. 

It can be scary running around your bland organization 
with a safety orange target on your back. It’s risky, and 
it takes fortitude to be different among friends and foes. 
But it’s important for your message to stand out, or it 
won’t be remembered. 

While you don’t necessarily need to rebel against the 
current messages and content, you do need to lift them 
out of the drab, traditional way they are communicated. 
Identify opportunities for contrast and then create fasci¬ 
nation and passion around these contrasts. Presentations 
today are boring because there is nothing interesting hap¬ 
pening. They have no contrast, and hence interest is lost. 
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People Are Interesting 


A great way to stand out is to be real. Presentations 
tend to be stripped of all humanness—despite the fact 
that humans make up the entire audience! Many corpo¬ 
rations condition employees to put meaningless words 
together, project them on a slide, and talk about them 
like an automaton. The cultural norm is for presenters 
to hide behind slides as though that’s a form of skilled 
communication. Look at the slides to the right. These 
are real statements taken from real presentations. 
They’re meaningless. Yet these statements were written 
to attract and lure customers to products or services. 
It’s the wrong bait. 

Presenters think they can hide behind a wall of jargon, 
but what people are really looking for at a presentation 
is some kind of human connection. 

By far the most human, transparent, and relational form 
of communication takes place when two people share 
common beliefs and create a connection based on 
beliefs. A presentation is an ideal opportunity for con¬ 
necting because it’s one of the few forms of interaction 
in which people are involved with one another in person. 


12 Resonate 


Deep connections are what make a great presentation 
stand out. Forming connections is an art, and when it’s 
practiced well, the results can be astounding. 

Being human and taking risks are the foundation of 
creative results. Taking risks shows you’re willing to tap 
into something your gut is telling you will work, without 
letting your head talk you out of it. That’s creativity 
and humanness at its best. Unfortunately, many cultures 
stifle risk-taking, and many workplaces constrain 
human connectedness. 

“Being true to yourself involves showing and sharing 
emotion. The spirit that motivates most great storytell¬ 
ers is ‘I want you to feel what I feel,’ and the effective 
narrative is designed to make this happen. That’s how 
the information is bound to the experience and 
rendered unforgettable.” 


Peter Guber 1 

It’s easier to rattle off jargon and keep communication 
emotionally neutral. But easiest doesn’t always mean best. 
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These statements are from real presentations 
that have had all the humanness sucked out 
of them. It’s easier to hide behind messages 
like these instead of tapping into what is 
human about us. 
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At XYZ Co. we create new, innova¬ 
tive businesses that would minimize 
the return-on-investment period for 
both strategic and financial inves¬ 
tors, while experiencing significant 
revenue expansion. 
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XYZ Co. creates the ultimate global 
alliance to monetize the Internet. 

We are the most reliable partner 
for global performance-based 
multichannel commerce, offering 
best-of-breed technology, services, 
and network to make more money 
with the Internet. 
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XYZ Co. creates a center for rapid 
prototyping of innovations that 
encourage rapid failures to create 
innovations of all kinds that 
create both an inbound and out¬ 
bound gradient. 


XYZ Co. is an international company 
of more than twenty talented profes¬ 
sionals dedicated to maximizing 
sales opportunities and revenues 
throughout Europe and North Amer¬ 
ica for quality media owners with 
leading online and/or print brands. 


XYZ Co. improves quality of life by 
improving capability maturity. 


XYZ Co. is an online global resource 
center and membership community 
dedicated to helping small-business 
owners succeed and prosper. 


XYZ Co. delivers to our clients 
their design team’s intent and 
vision, at the lowest overall total 
delivered costs, with no sacrifice 
in quality, on time, and at, or more 
typically below budget. 
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XYZ Co. enriches lives with superior 
products at exceptional prices. 


XYZ Co. provides every athlete—from 
professional to recreational runners 
to kids on the playground—with the 
opportunity, products, and inspira¬ 
tion to do great things. XYZ Co. helps 
consumers, athletes, artists, partners, 
and employees reach heights they 
may have thought unreachable. 
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Facts Alone Fall Short 


You can have piles of facts and still fail to resonate. It's 
not the information itself that’s important but the emo¬ 
tional impact of that information. This doesn't mean that 
you should abandon facts entirely. Use plenty of facts, 
but accompany them with emotional appeal. 

There’s a difference between being convinced with logic 
and believing with personal conviction. Your audience 
may agree with the thought process you present, but 
they still might not respond to the call. People rarely 
act by reason alone. You need to tap into other deeply 
seated desires and beliefs in order to be persuasive. You 
need a small thorn that is sharper than fact to prick their 
hearts. That thorn is emotion. 

“The problem is this: No spreadsheet, no bibliography 
and no list of resources is sufficient proof to some¬ 
one who chooses not to believe. The skeptic will 
always find a reason, even if it’s one the rest of us 
don’t think is a good one. Relying too much on proof 
distracts you from the real mission—which is 
emotional connection.” 


Seth Godin 2 

At some point in your life, you’ve had your emotions 
aroused. You’ve experienced a chill down your spine or 
a sick feeling in the pit of your stomach. When some¬ 
thing resonates emotionally, you feel it physically. 

Currently, emotion is a powerful driver of consumer 
behavior, but it didn’t used to be. Before the 1900s, 
people rarely expressed emotion publicly; it was not 
socially acceptable to discuss feelings or desires. 
Products developed were solely marketed as items of 
necessity, not items of desire. As PR and advertising 
became prevalent, companies began to compete 
based on consumer desire and not necessarily 
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consumer need. Suddenly, irrelevant objects became 
powerful symbols of status. 

Today, appealing to emotion is commonplace. Ads can 
make us laugh or cry, feel sexy or feel guilty. A full range 
of emotions can be felt during one thirty-minute televi¬ 
sion show. Even restaurant menus tantalize us with food 
that will make us feel decadent, surprised, or enraptured. 
We can’t escape it. 

So today more than ever, communicating only the 
detailed specifications or functional overviews of a 
product isn’t enough. If two products have the same 
features, the one that appeals to an emotional need 
will be chosen. 

Aristotle said that the man who is in command of persua¬ 
sion must be able “to understand the emotions—that is, 
to name them and describe them, to know their causes 
and the way in which they are excited,” and that “persua¬ 
sion may come through the hearers, when the speech 
stirs their emotions.” 3 

Consumers are accustomed to emotional appeal, and 
they are most certainly ready to respond emotionally to 
a presentation. So why don’t we present emotion? It’s 
uncomfortable. It’s an especially tough skill for analytical 
professionals to adopt. It’s easy to think, “I don’t get paid 
at work to feel, I get paid to do.” And that’s true. But if 
your team isn’t motivated to move forward or your cus¬ 
tomers aren’t motivated to buy, then you are in trouble. 

Including emotion in a presentation doesn’t mean it 
should be half fact and half emotion. It also doesn’t mean 
there should be boxes of tissue under each seat. It simply 
means that you introduce humanness that appeals to the 
desires of the audience. It’s not that difficult to evoke a 
visceral reaction in an audience if you use stories. 


“The public is composed of numerous groups 
whose cry to us writers is: ‘Comfort me.’ ‘Amuse 
me.’ ‘Touch my sympathies.’ ‘Make me sad.’ 
‘Make me dream.’ ‘Make me laugh.’ ‘Make me 
shiver.’ ‘Make me weep.’ ‘Make me think.’” 

Henri Rene Albert Guy de Maupassant 4 
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Stories Convey Meaning 


Ever since humans first sat around the campfire, sto¬ 
ries have been told to create emotional connections. 

In many societies, they have been passed along nearly 
unchanged for generations. The greatest stories of 
all time were packaged and transferred so well that 
hundreds of illiterate generations could repeat them. 
Our early ancestors had stories to explain day-to-day 
occurrences in nature such as why the sun rises and 
falls, as well as more overarching metanarratives about 
the meaning of life. Stories are the most powerful deliv¬ 
ery tool for information, more powerful and enduring 
than any other art form. 

People love stories because life is full of adventure 
and we’re hardwired to learn lessons from observing 
change in others. Life is messy, so we empathize with 
characters who have real-life challenges similar to the 
ones we face. When we listen to a story, the chemicals 
in our body change, and our mind becomes transfixed. 5 
We are riveted when a character encounters a situation 
that involves risks and elated when he averts danger 
and is rewarded. 

If you’re like many professionals, using stories to create 
emotional appeal feels unnatural because it requires 
showing at least some degree of vulnerability to people 
you don’t personally know all that well. Telling a per¬ 
sonal story can be especially daunting because great 
personal stories have a conflict or complication that 
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exposes your humanness or flaws. But these are also 
the stories that have the most inherent power to change 
others. People enjoy following a leader who has sur¬ 
vived personal challenges and can share her narrative 
of struggle and victory (or defeat) comfortably. 

“The best way to unite an idea with an emotion is by 
telling a compelling story. In a story, you not only 
weave a lot of information into the telling but you also 
arouse your listener’s emotions and energy. Persuading 
with a story is hard. Any intelligent person can sit down 
and make lists. It takes rationality but little creativity to 
design an argument using conventional rhetoric. But it 
demands vivid insight and storytelling skill to present 
an idea that packs enough emotional power to be 
memorable. If you can harness imagination and the 
principles of a well-told story, then you get people 
rising to their feet amid thunderous applause instead 
of yawning and ignoring you.” 


Robert McKee 6 

Information is static; stories are dynamic—they help an 
audience visualize what you do or what you believe. Tell 
a story and people will be more engaged and receptive 
to the ideas you are communicating. Stories link one 
person’s heart to another. Values, beliefs, and norms 
become intertwined. When this happens, your idea can 
more readily manifest as reality in their minds. 
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You Are Not the Hero 


When trying to connect with others during a presenta¬ 
tion, you have to remember that it’s not all about you. 

Audiences detest arrogance and self-centeredness. 
They evoke the same feeling you get when you arrive at 
a party only to be cornered by a dreadful, self-centered 
know-it-all. He’ll talk about his own interests, how cool he 
is, and how great he is while you're left thinking, “What an 
ass," and looking for any opportunity to get away. Why is 
that? It’s because the conversation doesn’t include you, 
your ideas, or your perspective. Self-centered people 
don’t connect. No one wants to date, work with, or sit 
through a presentation given by someone like that. So 
why are presentations rife with self-centered content? 
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Most presentations start with “me-ness.” Somewhere 
in the front of the slide deck is the dreaded “it's all 
about me” slide that typically looks like one of the 
slides to the right. 

It is important that the audience know something about 
you and your company. There are other ways to com¬ 
municate this information (like a handout) so you can 
focus on the people in the audience right at the onset 
and focus your presentation so it resonates at their 
frequency instead of yours. 

As a presenter, it’s easy to feel like your product or 
cause should be the most important thing on the minds 
of the audience. You may even think, “I’m their hero, here 
to save them from their helplessness and ignorance. If 
they only knew what I know, the world would be a better 
place.” If you show up and chatter about yourself, your 
products, and your synergies, you will become the self- 
centered know-it-all at the party, and the audience will 
want to flee. 

Instead, embrace a stance of humility and deference to 
your audience’s needs. Begin the presentation from a 
shared place of understanding. 


Make it about the audience. 


SELFISH APPROACH 

About us 

• Company history 

• Market cap 

• # employees and # locations 
About our product and service 

• What it is 

• How it works 

• Why it’s better than the alternative 
Call to action (ideally) 


SAMPLES FROM BAD PRESENTATIONS 


XYZ Co. Equity Partners, LLC 

• Founded in 1988 in Anchorage, Alaska 

• Invest in companies who: 

• Provide professional IT services 

• Offer exceptional technical and 
project management expertise 

• Deliver complex data and information 
management solutions as systems 
and/or applications integrators 

• Average annual revenue: $51.5M 

XYZ Co. Software 

• Established in 1984 

• Headquarters: San Francisco, CA 

• Integrated P&C Insurance software 
and services 

• Focused on Alternative Risk & 

Self-Insured markets 

• Recognized leader in risk 
management solutions 

• Over 100 customers in U.S. and Canada 
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The Audience Is the Hero 


You need to defer to your audience because if they 
don't engage and believe in your message, you are the 
one who loses. Without their help, your idea will fail. 

You are not the hero who will save the audience; the 
audience is your hero. 

Screenwriter Chad Hodge points out in Harvard 
Business Review that we should "[help] people to see 
themselves as the hero of the story, whether the plot 
involves beating the bad guys or achieving some great 
business objective. Everyone wants to be a star, or at 
least to feel that the story is talking to or about him 
personally.’’ 7 Business leaders need to take this to heart, 
place the people in the audience at the center of the 
action, and make them feel that the presentation is 
addressing them personally. 

When you’re presenting, instead of showing up with an 
arrogant attitude that "it’s all about me,” your stance 
should be a humble "it’s all about them.” Remember, 
the success of you and your firm is dependent on them, 
not the other way around. You need them. 

So what’s your role then? You are the mentor. You’re 
Yoda, not Luke Skywalker. The audience is the one 
who’ll do all the heavy lifting to help you reach your 
objectives. You’re simply one voice helping them get 
unstuck in their journey. 

The mentor is often personified as a wise person such 
as The Oracle in The Matrix or even Mr. Miyagi in The 
Karate Kid. As mentor, your role is to give the hero 
guidance, confidence, insight, advice, training, or 
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magical gifts so he can overcome his initial fears and 
enter into the new journey with you. 

Changing your stance from thinking you’re the hero to 
acknowledging your role as mentor will alter your view¬ 
point. You’ll come from a place of humility, the aide-de- 
camp to your audience. A mentor has a selfless nature 
and is willing to make personal sacrifices so that the 
hero can reach the reward. 

Most mentors were heroes themselves. They have 
become experienced enough to teach others about the 
special tools or powers they picked up on the journey of 
their own lives. Mentors have been down the road of the 
hero one or more times and have acquired skills that can 
be passed on to the hero. 

When you step up to give your presentation, you might 
be the most knowledgeable person in the room, but will 
you wield that knowledge with wisdom and humility? 
Presentations are not to be viewed as an opportunity to 
prove how brilliant you are. Instead, the audience should 
leave saying, "Wow, it was a real gift to spend time 
in that presentation with (insert your name here). I’m 
armed with insights and tools to help me succeed that 
I didn’t have before.” 

Changing your stance from hero to mentor will clothe 
you in humility and help you see things from a new 
perspective. Audience insights and resonance can only 
occur when a presenter takes a stance of humility. 




Presentations have the power to change the world. The nexus of almost 
every movement and high-stakes decision relies on the spoken word to 
get traction, and presentations are a powerful platform to persuade. 

But presentations are broken; they are considered a necessary evil 
instead of a tool of great power. That power springs from the presenter’s 
ability to make a deep human connection with others. Instead of connecting 
with others, presentations tend to be self-centered, which alienates audi¬ 
ences. The opportunity to transform is diminished when audiences don’t 
feel a connection. 

Changing your stance from that of the hero to one of wise storyteller will 
connect the audience to your idea, and an audience connected to your 
idea will change. 
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Incorporate Story 


All types of writing, including presentations, fall somewhere in between two extreme poles: reports and stories. 
Reports inform, while stories entertain. The structural difference between a report and a story is that a report 
organizes facts by topic, while a story organizes scenes dramatically. 1 Presentations fall in the middle and contain 
both information and story, so they are called explanations. 


REPORT 

Exhaustive 


PRESENTATION 

Explanatory 


STORY 

Dramatic 


\ 





I 


Documentation 

Informational and factual, empha¬ 
sizing accuracy and exhaustive 
details, facts, and figures 


Oral Delivery 

Persuasive and motivating, 
emphasizing explanation and 
making the meaning clear 


Cinema and Literature 

Experiential and emotional, emphasiz¬ 
ing evocative and implied information 


Structure 

Topical, hierarchical 

Dual, alternating between facts 
and storytelling 

Dramatic (exposition, rising action, 
climax, denouement) 

Activities 

Survey, collect, record, evaluate, 
notify, update 

Unfold, simplify, clarify, interpret, 
illuminate, elucidate 

Experience, express, emote, sense 

Result 

Findings, evidence, facts, details 

Motivation, activation, engagement 

Memories, links, associations 

Delivery 

Communicate in a plain, direct, 
and precise manner 

Communicate in a believable, 
credible, and engaging manner 

Communicate in an expressive and 

theatrical manner 
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It’s become the cultural norm to write presentations 
as reports instead of stories. But presentations are not 
reports. Many people who create presentations are stuck 
in the mindset that if they use a presentation application, 
like PowerPoint, to create a report, the report is a presen¬ 
tation. It is not! Reports should be distributed; presenta¬ 
tions should be presented. Documents masquerade as 
presentations, and these “slideuments” 2 have become 
the lingua franca of many organizations. While documents 
and reports are very valuable, they do not need to be 
projected for the purpose of hosting a "read-along.” 

So if a report primarily conveys information, then stories 
produce an experience. Blending the two creates a perfect 
world for your presentation where facts and stories can be 
layered like a cake. Navigating between fact, then story, 
then fact, then story creates interest and a pulse. Mixing 
report material with story material makes information 
more digestible. It's the sugar that helps the medicine 
go down. 


It's more comfortable and less time consuming to present 
flat, data-driven static reports, but that approach doesn't 
connect people to ideas. The moment you know you 
need to create a presentation and not a report, shift your 
mindset from solely transferring information to creating 
an experience. This is the first step in moving along the 
spectrum away from a pure report toward a story. 

There are plenty of opportunities to use dramatic story 
structure in presentations. But how do you create a 
dramatic experience? Creating desire in the audience 
and then showing how your ideas fill that desire moves 
people to adopt your perspective. This is the heart of 
a story. 

This chapter will draw insights from the best story 
methods available today: mythology, literature, and 
cinema. Once you understand their power, you’ll see 
why great presentations move away from reports and 
closer to stories. 
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Drama Is Everything 


Presentations have the potential to hold an audience's 
interest just like a good movie. You might be thinking 
that it takes years to write a successful screenplay, and 
you have a real job to do. But isn’t part of your “real job” 
to communicate ideas well, help people understand 
objectives, and persuade them to change? Building your 
presentations with some of the attributes from myths 
and movies will help your ideas resonate with others. 

Great stories introduce you to a hero to whom you can 
relate. The hero is usually a likeable sort who has an 
acute desire or goal that is threatened in some way. 

As the story unfolds and trials are met with triumph, 
you cheer for the hero until the story is resolved and 
the hero is transformed. As author Robert McKee 
explains, “Something must be at stake that convinces 
the audience that a great deal will be lost if the hero 
doesn't obtain his goal.” 3 If nothing is at risk, then it's 
not interesting. 
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Your communications follow a similar pattern. You have 
a goal that needs to be reached, but there will be trials 
and resistance. However, when your desire is realized, 
the outcome will yield remarkable results. 

One of the reasons presentations are dull is because there 
are no identifiable story patterns. In the next few pages, 
you’ll review story models actively used in Hollywood 
that are fundamental to a good screenplay. These forms 
work! They are not formulas or rigid sets of rules—they 
address structure and character transformation, yet also 
leave room for flexibility and creativity. After you review 
the Hollywood story forms, you’ll be introduced to the 
presentation form. It’s a similar form, but one that’s 
tailored to presentations. Applying these methods will 
help craft your message and unlock the story potential 
in your presentations. 


Story Pattern 

The most simplistic way to describe the structure of a story is situation, complication, and resolution. From mythic adventures 
to recollections shared around the dinner table, all stories follow this pattern. 


RELATABLE AND LIKABLE HERO 

ENCOUNTERS ROADBLOCKS 

EMERGES TRANSFORMED 

Snow White 

Situation: Snow White takes refuge 
in the forest with seven dwarfs 
to hide from her stepmother, the 
wicked queen. 

Complication: Snow White is more 
beautiful than her stepmother so, 
disguised as a peddler, she poisons 
Snow White with an apple. 

Resolution: The prince, who has 
fallen in love with Snow White, 
awakens her from the spell with 
“love’s first kiss.” 

E.T. 

Situation: A group of alien botanists 
visit earth. After a hasty takeoff, one 
of them is left behind. And he wants 
to get back home. 

Complication: Ten-year-old Elliott 
forms an emotional bond with E.T., 
a task force tries to hunt down E.T., 
and he and Elliott get very sick. 

Resolution: E.T. and Elliott build a 
communication device and escape 
on a bicycle. E.T. is rescued and tells 
Elliott he’ll be in his heart. 

Avatar 

Situation: Jake Sully is a paralyzed 
ex-Marine who is selected for the 

Avatar program, which will enable 
him to walk through a proxy Na’vi 
body in the land of Pandora. 

Complication: Jake falls in love with 
a Na’vi woman, Neytiri, in Pandora. 

As the humans encroach on the 
forest seeking valuable minerals, 

Jake is forced to choose sides in 
an epic battle. 

Resolution: Under Jake’s leadership, 
the Na’vi defeat the humans. Jake 
is permanently transformed into a 
Na’vi and gets to live on Pandora 
with Neytiri. 
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Story Templates Create Structure 


Screenwriters use tools to create strong story stuctures. 
Syd Field is considered the father of Hollywood’s story 
template. In his book, Screenplay, Field uses concepts 
from the three-act structure first proposed by Aristotle 
to create the Syd Field Paradigm, shown on the right. 
Field noticed that in successful movies, the second act 
was often twice the length of the first and third acts: 

• Act 1 sets up the story by introducing characters, 
creating relationships, and establishing the hero’s 
unfulfilled desire, which holds the plot in place. 

• Act 2 presents dramatic action held together by 
confrontation. The main character encounters 
obstacles that keep him or her from achieving his 
or her desire (dramatic need). 

• Act 3 resolves the story. Resolution doesn’t mean 
ending but rather solution. Did the main character 
succeed or fail? 4 

All stories have a beginning, middle, and an end. There’s 
a defining point in which the beginning turns into the 
middle and the middle into the end. Field, a leading 
screenwriting teacher, calls these plot points. A plot 
point is defined as any incident, episode, or event that 


spins the story around in another direction. Each plot 
point sets up the story for a change. 

A great presentation is similar to a screenplay in 
several ways: 

• It has a clear beginning, middle, and end. 

• It has an identifiable, inherent structure. 

• The first plot point is an incident that captures 
the audience’s intrigue and interest. In presenta¬ 
tions, we’ll call this a turning point. 

• The beginning and end are much shorter than 
the middle. 

This is a form, not a formula. It’s what a screenplay 
would look like if you could X-ray it and examine its 
structure. The movie Shawshank Redemption * is shown 
to the right with the acts and plot points annotated. 

Field’s model makes sense as a template for scripting 
movies; however, it is only partially applicable to 
presentations. Next, we’ll examine an additional story 
form that will supply some of the missing pieces. 
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Syd Field’s Paradigm 


ACT 1 
Set-Up 


ACT 1 


Andy is 
convicted 
and enters 
Shawshank 


•SHAWSHANK 

REDEMPTION 

STORY 


ACT 2 

ACT 3 

Confrontation 

Resolution 

First Half 

Second Half 


- c 

^ - A 

- 1 


- 






PLOT POINT 1 

ACT 2A 

MIDPOINT 

ACT 2B 

PLOT POINT 2 

ACT 3 

Andy asks 

Red for rock 
hammer 

Andy forms 
relationship with 
Red and adapts 
to prison life 

Andy plays 
opera aria 
over prison 
loudspeaker 

Andy passes 
his knowledge 
on to inmates 

Andy escapes 
from prison 

Andy and 
Red reunite 

in Mexico 


Andy, a young banker convicted of murdering his wife and her lover, is sentenced to Shawshank 
Penitentiary, in prison, Andy meets and forms a relationship with Red, another convicted killer, and 
then becomes an ally and trusted friend of the warden. When his attempts for a retrial fail, he escapes 
from Shawshank. At the end, Andy makes his way to Mexico, where he and Red are reunited. 
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The Hero’s Journey Structure 


Another story model to consider is The Hero’s Journey, 
drawn from the psychology of Carl Jung and mythologi¬ 
cal studies of Joseph Campbell. 

The wheel to the right is an overview of The Hero’s 
Journey that has been slightly simplified by Christopher 
Vogler, author of The Writer's Journey. Vogler spent 
years as a story analyst for screenplays in Hollywood 
and uses this as a form for his analyses. Starting at the 
top of the wheel, move clockwise through the steps. 
The gray text of the innermost circle walks you through 
the stages of The Hero's Journey: (1) Heroes are intro¬ 
duced in the Ordinary World, where (2) they receive 
the Call to Adventure. (3) They are initially reluctant 
and might even Refuse the Call but (4) are encouraged 
by a Mentor to (5) Cross the First Threshold and enter 
the Special World, where (6) they encounter Tests, 
Allies, and Enemies. (7) They Approach the Inmost 
Cave, where (8) they endure the Ordeal. (9) They take 
possession of their Reward and (10) are pursued on the 
Road Back to the Ordinary World. (11) They experience 
a Resurrection and are transformed by the experience. 
(12) They Return with the Elixir—a boon or treasure to 
benefit the Ordinary World. 6 

Heroes endure physical activities (outer journey) but also 
experience internal transformations to their hearts and 
minds at each stage. This inner journey is represented by 
green text in the second ring. Then, the outermost ring 
uses Star Wars: Episode IV as an example, showing the 
outer journey in gray text and the inner journey in green. 

An important insight emerges when The Hero’s 
Journey is represented in a circle: It creates a clear 
division between the ordinary world and the special 


world (signified by the gray dotted line). There is a 
moment in every story where the character overcomes 
reluctance to change, leaves the ordinary world, and 
crosses the threshold into an adventure in a special 
world. In the special world, the hero gains skills and 
insights—and then brings them back to the ordinary 
world as the story resolves. 

A good presentation is a satisfying, complete experi¬ 
ence. You might cry, laugh, or do both, but you’ll also 
feel you’ve learned something about yourself. 

Presentations use insights from myths and movies in 
several ways: 

• There’s a likable yet flawed hero attending your 
presentation. 

• A presentation should take the audience on a 
journey from their ordinary world into your special 
world, gaining new insights and skills from your 
special world. 

• The audience makes a conscious decision to cross 
the threshold into your world; they are not forced. 

• The audience will resist adopting your point of view 
and will point out obstacles and roadblocks. 

• The audience needs to change on the inside before 
they’ll change on the outside. In other words, they 
need to alter their perception internally before they 
change the way they act. 

Crossing the threshold is an important moment because 
it signals that the hero is making a commitment. Let’s 
look more closely at that turning point. 
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The Hero’s Journey 


ACT 3 


In the final battle, Luke 
hears Obi-Wan’s voice and 
uses the Force to make 
an impossible shot that 
destroys the Death Star. 


The power of the Evil 
Empire is destroyed. 
The team members are 
honored as heroes and 
peace is restored to 
the galaxy. 


Obi-Wan sacrifices 
himself to help the 
team escape. The 
Death Star follows 
them to the Rebels, 
determined to destroy 
their base. Luke joins 
the Rebel’s attack on 
the Death Star. 


In the trash com¬ 
pactor, Luke is 
pulled underwater 
by a creature but 
is rescued by his 
friends. They begin 
to work together as 
a team to escape 
the Death Star. 



The Evil Empire 
oppresses the galaxy. 
Luke dreams of joining 
the academy but feels he 
is going nowhere on his 
uncle’s desolate farm. 


Limited 
awareness of 
a problem 


ACT 1 


R2D2 plays a portion of 
Princess Leia’s call for 
help. Luke is smitten by 
the vision and wants to 
help the maiden 
in distress. 


Return 
with the 
Elixir 


Ordinary 

World 


Resurrection 




Road Back 

Conse- 

Reward ' 

; quences of 

(seizing the 

the attempt 

sword) 


(improve¬ 
ments and 
setbacks) 




Call to 
Adventure 


Refusal 

to change 

, of the Call 

I 

' Meeting with 

I 

... the Mentor 

Overcoming i 


reluctance 


Ordeal 


Tests, 

Allies, 

and 

Enemies 


Crossing the 
Threshold 


Committing 
to change 


On the Death Star, they 
dress as Stormtroopers, 
discover the princess, and 
attempt to rescue her. 
They are discovered and 
tested as they engage 
with enemy troops. 




Luke refuses to 
follow Obi-Wan 
because he feels 
obligated to stay 
and help his aunt 
and uncle on 
the farm. 


R2D2 plays the 
entire message, 
revealing that Luke 
holds the plans of 
the Death Star. Obi- 
Wan gives Luke his 
father’s lightsaber 
and tells him of his 
heritage. Luke 
wants to help. 


ACT 2B 


On the Millennium 
Falcon, Obi-Wan teaches 
Luke about the Force. The 
ship is captured by the Death 
Star, and the group finds itself 
inside the enemy’s stronghold. 


gray text = inner journey 

green text = outer journey (character transformation) 

Factoid: When George Lucas came across Joseph Campbell’s work, he 
modified Star Wars, Episode IV to map more closely to this model. 


In the cantina, Luke is 
saved by Obi-Wan's use of 
the Force. The two hire Han 
Solo and Chewbacca, who 
become their allies. They evade 
Imperial Stormtroopers who try 
to prevent their escape. 


Luke's aunt and uncle 
are killed, so he is free 
to deliver the secret 
plans to Alderaan. He 
and Obi-Wan travel to 
Mos Eisley to hire a 
ship for their journey. 


ACT 2A 
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Crossing the Threshold 


If the audience is the hero in your story, then the objec¬ 
tive during your presentation is to get them past the 
fourth step in the wheel. Your presentation takes them 
to the threshold, but it’s their choice whether to cross 
it or not. 

Your presentation proposes an idea, and you’re 
asking the audience to adopt and shepherd that idea 
to positive outcomes. Your idea might be to reshape 
an organization for the future or to show customers 
how your product will fill a need they have. It might 
even be to have students test well and internalize 
the subject matter. Whatever it is, the decisions the 
audience might make require them to consciously 
step into something new. 

The change you’re requesting will not come without 
a struggle for your heroes—and you need to acknowl¬ 
edge that. Change is hard. Getting people to commit 
to change is probably an organization’s greatest 
challenge. Notice how the hero meets the mentor just 
when he or she needs to decide whether to cross 
the threshold—and enter the special world. It’s a lovely 
parallel to presenting. As their mentor, your insights 
will help the audience make a decision to change. But 
you can’t force them. If you present well, they’ll cross 
the threshold voluntarily and jump in. 
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If the audience has decided to cross the threshold and 
adopt your perspective, they begin the rest of The 
Hero's Journey (stages five through twelve) when they 
leave your presentation. As their mentor, your presenta¬ 
tion should prepare them as much as possible for what 
they can expect on the rest of the journey and set them 
up to be successful along the way. Usually, the stages 
of The Hero’s Journey in movies take place in sequential, 
chronological order. But when developing a presenta¬ 
tion, you aren’t bound to keep to the constraints of a 
place and time. The presentation medium allows you to 
bounce around out of sequence as you address insights 
into how steps five to twelve will be accomplished. 

Let’s remember that there is one indisputable attribute 
of a good story: there must be some kind of conflict or 
imbalance perceived by the audience that your presen¬ 
tation resolves. This sense of discord is what persuades 
them to care enough to jump in. In a presentation, you 
create imbalance by consciously juxtaposing what is 
with what could be. 

Clearly contrast who the audience is when they walk 
into the room (in their ordinary world) with whom 
they could be when they leave the room (crossing the 
threshold into a special world). What is versus what 
could be. Drawing attention to that gap forces the 
audience to contend with the imbalance until a new 
balance is achieved. 


The Audience’s Journey 


Q 


Utilizing their new 
tools, they try one 
final time to push 
the idea forward 
and are victorious. 


O 


They decide ter 
continue on with a 
renewed excitement, 
even though the 
resistance around 
them is chronic. 


The idea is 
widely adopted 
and the galaxy is 
a better place. 


Return 
with the 
Elixir 


A likable audi¬ 
ence is unaware 
that they have a 
problem or 
opportunity. 


Ordinary 

World 


They are shown 
a unique idea 
that brings their 
world into an 
imbalance. 


Resurrection 


Call to 
Adventure 


They are skeptical, 
afraid, and resistant 


Road Back 


e 


They get discour¬ 
aged and consider 
giving up on the 
idea, but they begin 
to see some benefit 
from their efforts., 


Reward 
(seizing the 
sword) 




to adopt it because 

Refusal 

it will require change, \ 1 

l of the Call 

and change is hard. 

' Meeting with 

But a presenter 

the Mentor 

with experience, ffl 


valuable insights, 1 1 


Ordeal 


They take a major 
step toward your 
idea, and it doesn’t 
quite work out as 
they’d thought. 

o 

1 ^ 


Their commitment 
will be tested, and 
they'll need to renew 
their loyalty to the 
idea over and over 
before it’s reality. 


Approach 
the Inmost 
Cave 


They are deter¬ 
mined to push the 
idea forward and 
begin to work on 
new skills to be 
successful. 


Crossing the 
Threshold 


Tests, 

Allies, 

and 

Enemies 


Now the real 
work begins, but 
it’s hard. People 
and things 
oppose the effort 
to change. 


and magical tools 
will help on the 
journey. 


So they decide to 
jump in and com¬ 
mit to the idea. 


o 


V 

© £***>%* 


Your goal is to get them 
to commit to crossing the 
threshold and adopting 
your perspective. Once the 
audience commits to jumping 
in, the real adventure begins. 


One of the things that 
makes an audience re¬ 
sistant is they can see 
how tough stages six 
through eleven are go¬ 
ing to be. It’s your job 
to acknowledge that 
you know how tough 
the journey will be. 

The audience will stay 
skeptical and won’t cross 
the threshold into your 
special perspective un¬ 
less you have wisdom to 
guide them and a useful 
idea or tool to give them. 


gray text = The Hero’s Journey 

blue text = the audience’s journey 
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The Contour of Communication 

The Presentation Form 


Drawing insights from mythological, literary, and cinematic 
structures, a presentation form emerged. Most great presen¬ 
tations unknowingly follow this form. 

Presentations should have a clear beginning, middle, and end. 
Two clear turning points in a presentation’s structure guide 
the audience through the content and distinctively separate 
the beginning from the middle and the middle from the end. 
The first is the call to adventure —this should show the audience 
a gap between what is and what could be —jolting the audience 
from complacency. When effectively constructed—an imbal¬ 
ance is created—the audience will want your presentation to 
resolve this imbalance. The second turning point is the call to 
action, which identifies what the audience needs to do or how 
they need to change. This second turning point signifies that 
you’re coming to the presentation’s conclusion. 

Notice how the middle moves up and down as if something new 
is happening continually. This back and forth structural motion 
pushes and pulls the audience to feel as if events are constantly 
unfolding. An audience will stay engaged as you unwrap ideas 
and perspectives frequently. 

Each presentation concludes with a vivid description of the new 
bliss that’s created when your audience adopts your proposed 
idea. But notice that the presentation form doesn’t stop at the 
end of the presentation. Presentations are meant to persuade, 
so there is also a subsequent action (or crossing the threshold) 
the audience is to do once they leave the presentation. 

Let’s look at the form in more detail on the following pages. 
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BEGINNING 


Paint a picture of the 
realities of the audience’s 
current world. 


What could be 


The >-► 

gap 


What 

f 

Turning Point 1 

CALL TO 
ADVENTURE 

Create an imbalance by 
stating what could be 
juxtaposed to what is. 


What is 








MIDDLE 

Present contrasting content, alternating 
between what is and what could be. 


What could be 


What could be 


What is 


END 


CROSS THE THRESHOLD 


End the presentation 
on a higher plane than 
it began, with every¬ 
one understanding the 
reward in the future. 


The audience leaves the 
presentation committed to taking 
action, knowing it won’t be easy 
but will be worth the reward. 


Reward: new bliss 

-1 


What is 

t 

Turning Point 2 


CALL TO 
ACTION 


Articulate the finish line 
the audience is to cross. 
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The Beginning and Call to Adventure 


The Hero’s Journey begins when “a hero ventures 
forth from the world of common day into a region of 
supernatural wonder.” 7 Your presentation may not offer 
“supernatural wonder,” but you are asking the audience 
to leave their comfort zone and venture to a new place 
that is closer to where you think they should be. 

The beginning of the presentation form is everything 
that comes before the first turning point, the call to 
adventure. The first flat line of the form represents 
the beginning of your presentation. This is where you 
describe the audience’s ordinary world and set the 
baseline of what is. You can use historical information 
about what has been or the current state of what is, 
which often includes the problem you’re currently facing. 

You should deliver a concise formulation of what everyone 
agrees is true. Accurately capturing the current reality and 
sentiments of the audience’s world demonstrates that you 
have experience and insights on their situation and that 
you understand their perspective, context, and values. 

Done effectively, this description of where your audi¬ 
ence currently is will create a common bond between 
you and them and will open them up to hear your 


unique perspective more readily. Audiences are grateful 
when their contribution, intelligence, and experience 
are acknowledged. 

Additionally, describing their existing world gives 
you the opportunity to create a dramatic dichotomy 
between what is and what could be. Proposing what 
could be should throw the audience’s current reality 
out of balance. Without first setting up what is, the 
dramatic effect of your new idea will be lost. 

The beginning doesn’t have to be long. It might be as 
simple as a short statement or phrase that sets the 
baseline of what is. While it can be longer, it should not 
take up more than 10 percent of your total time. The 
audience will be anxious to know why they came and 
what you are proposing. So, although the beginning 
is important, it shouldn’t be long-winded. 

The first turning point to occur in a presentation is the 
call to adventure, which triggers a significant shift in 
the content. The call to adventure asks the audience 
to jump into a situation that, unbeknownst to them, 
requires their attention and action. This moment sets 
the presentation in motion. 

“A bad beginning makes a bad ending.” 


Euripides 8 
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To create the call to adventure, put forth a memorable 
big idea that conveys what could be. This is the moment 
when the audience will see the stark contrast between 
what is and what could be for the first time—and it's 
crucial that the gap is clear. 


What 
could be 


\ 


The 

gap 


What is 


Dramatic tension 
is created by 
contrasting the 
commonplace 
with the lofty. 


The call to adventure in a presentation plays a role 
similar to the inciting incident in a movie. Story author 
Robert McKee says, “The inciting incident first throws 
the protagonist's life out of balance, then arouses in him 
the desire to restore that balance.” 9 That imbalance is 
what elicits the audience's desire for a reality different 
from the current one. Pose an intriguing insight that 
your audience will want the presentation to address. It 
should stir them up enough (positively or negatively) so 
that they want to listen intently as you explain what is at 
stake and what it takes to resolve the gap. 

This turning point should be explicit, not muddled or 
vague. The remainder of the presentation should be 
about filling that gap and drawing the audience toward 
your unique perspective of what could be. 


“Man is the only animal that laughs and weeps; for he 
is the only animal that is struck with the difference 
between what things are and what they ought to be.” 

William Hazlitt 10 


BELOW IS AN EXAMPLE OF A CALL TO ADVENTURE FOR A PRODUCT LAUNCH 


What is: Analysts have been placing our products in the 
top spot in three out of five categories. Our competitor 
just shook up the industry with the launch of their T3xR. 
It has been heralded as the most innovative product in 
our space for the last four years. The predictions are 
that firms like ours will have no future unless we license 
the T3xR from our competitor. 


What could be: But we will not concede! In fact, today 
we will retain our lead! I'm pleased to tell you that five 
years ago we had the same product idea as the T3xR. 
But after rapid prototyping we discovered a way to 
leapfrog that generation of technology. So today, 
we’re launching a product so revolutionary that we'll 
gain a ten-year lead over our competitors. Ladies and 
gentlemen, introducing the e-Widget. Isn’t it beautiful? 
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The Middle: Contrast 


The middle of a presentation is made up of various 
types of contrast. People are naturally drawn to 
contrast because life is surrounded by it. Day and 
night. Male and female. Up and down. Good and evil. 
Love and hate. 

Your job as a communicator is to create and resolve 
tension through contrast. 

Building highly contrasting elements into a presenta¬ 
tion holds the audience’s attention. Audiences enjoy 
experiencing a dilemma and its resolution—even if that 
dilemma is caused by a viewpoint that’s opposed to 
their own. It keeps them interested. 

The audience wants to know if your views are similar 
to or different from their views. While listening to a 
presenter, audience members catalog and classify what 
they hear. Having come into the room with their own 
knowledge, and biases, they are constantly evaluating 
whether what you say fits within their life experiences 
or falls outside of what they know. 

It’s important to know your audience so that you can 
understand how your views are both similar to and dif¬ 
ferent from theirs. There will usually be some disparities. 
A rather obvious business example would be that you 
want them to buy your product, and they don’t want to 
spend the money. 

But differences aren’t a problem. The polarity between 
similar and dissimilar concepts creates a force that can 
be put to good use. In fact, both extremes are necessary 
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in a presentation. They allow you to create observ¬ 
able distinctions between your perspectives and your 
audience’s perspectives—this helps keep their attention. 
Though people are generally more comfortable with 
what’s familiar to them, conveying the opposite creates 
internal tension. Oppositional content is stimulating; 
familiar content is comforting. Together, these two 
types of content produce forward movement. 

There are three distinct types of contrast you can build 
into a presentation: 

• Content: Content contrast moves back and forth to 
compare what is to what could be—and your views 
versus the audience’s (pages 104 to 105). 

• Emotion: Emotional contrast moves back and forth 
between analytical and emotional content (pages 
136 to 137). 

• Delivery: Delivery contrast moves back and forth 
between traditional and nontraditional delivery 
methods (pages 138 to 139). 

Contrast is a motif woven throughout this entire book 
and is at the heart of communication, because people 
are attracted to things that stand out. 

“As the polarized nature of magnetic fields can be 
used to generate electrical energy, polarity in a story 
seems to be an engine that generates tension and 
movement in the characters and a stirring of emo¬ 
tions in the audience.” 


Chris Vogler 
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Call to Action 


The second transition, the call to action, clearly defines 
what you're asking the audience to do. Successful 
persuasion leads to action, and it is important to clearly 
state exactly how you want the audience to take action. 
This step in the presentation gives the audience discrete 
tasks that will help bring the ideas you convey in your 
presentation to fruition. Once this line is crossed, the 
audience needs to decide if they are with you or not— 
so make it clear what needs to be accomplished. 

Whether a presentation is political, corporate, or 
academic, the audience consists of four distinct types 
of people capable of taking action: doers, suppliers, 
influencers, and innovators. 

Because of differences in temperaments, every audi¬ 
ence member will have a natural preference for one 
type over another. Providing each type with at least 
one action that’s suited to their temperament allows 
them to choose the action they’re most comfortable 
performing. When audience members see how they 
can help, it leads to momentum and quicker results. 
Most people are equipped with the ability to carry 
out at least one of the four types of actions effectively. 

A truly passionate revolutionary for your ideas could 
embody all four of the action types. 
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Sample calls to action that can be requested of 
an audience: 

• The doer can be asked to assemble, decide, gather, 
respond, or try. 

• The supplier can be asked to acquire, fund, provide 
resources, or provide support. 

• The influencer can be asked to activate, adopt, 
empower, or promote. 

• The innovator can be asked to create, discover, 
invent, or pioneer. 

Be sure to identify actions that are simple, straight¬ 
forward, and easily executed. The audience should be 
able to mentally connect their actions with a positive 
outcome for themselves, or for the greater good. 
Present all the necessary actions and make sure the 
most critical tasks for success are emphasized. 

Many presentations end with the call to action; however, 
ending a presentation with a to-do list for the audience is 
not inspirational. So it’s important to follow up the call to 
action with a vivid picture of the potential reward. 


WHO THEY ARE 


WHAT THEY 
DO FOR YOU 


HOW THEY 
DO IT 



DOERS 


Instigate Activities 


These audience 
members are your 
worker bees. Once 
they know what has 
to be done, they'll 
do the physical 
tasks. They recruit 
and motivate other 
doers to complete 
important activities. 



SUPPLIERS 


Get Resources 


These audience 
members are the 
ones with the 
resources—financial, 
human, or material. 

They have the 
means to get you 
what you need to 
move forward. 



INFLUENCERS 


>A° 

h 

INNOVATORS 


Change Perceptions Generate Ideas 


These audience 
members can sway 
individuals and 
groups, large and 
small, mobilizing 
them to adopt and 
evangelize your idea. 


These audience 
members think outside 
the box for new ways 
to modify and spread 
your idea. They create 
strategies, perspectives, 
and products. They 
bring their brains to 
the table. 
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The End 


Notice that the end of the presentation is on a higher 
plane in the presentation form than the beginning. The 
ending should leave the audience with a heightened 
sense of what could be and a willingness to be trans¬ 
formed—to be able to either understand something 
new or do something differently. Audience transforma¬ 
tion is the goal of persuasion. Skillfully defining the 
future reward compels the audience to get on board 
with your idea. 

The ending should repeat the most important points 
and deliver inspirational remarks encompassing what 
the world will look like when your idea is adopted. 

The principle of recency states that audiences remem¬ 
ber the last content they heard in a presentation more 
vividly than the points made in the beginning or middle. 
So you should create an ending that describes an 
inspirational, blissful world—a world that has adopted 
your idea. What will the audience members’ lives 
look like? What will humanity look like? What will the 
planet look like? 

In order to get the most out of the audience, describe 
the possible future outcomes with wonder and awe. 
Show the audience that the reward will be worth their 
efforts. The presentation should conclude with the 
assertion that your idea is not only possible but that 
it is the right—and better—choice to make. 
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“Getting the audience to cheer, rise, and vocalize in 
response to a dramatic, rousing conclusion creates 
positive emotional contagion, produces a strong 
emotional takeaway, and fuels the call to action by 
the business leader. The ending of a great narrative 
is the first thing the audience remembers.” 


Peter Guber 12 

Let’s say you pulled off an incredible presentation. You 
used the principles in the presentation form with grace 
and ease to convey your ideas, and the audience made 
a commitment to transform. Sounds like a huge victory— 
but it’s not over yet. The end of your presentation marks 
the next phase of the adventure for the audience. 

The human ability to accept new insights creates room 
for people to become something different. As indicated 
by the final dashed line at the end of the presentation 
form, the audience starts becoming something different 
from what they were at the beginning of the presentation. 

But when you are done delivering your presentation, the 
adoption of your idea is still inconclusive. The audience 
will determine the outcome. Great presentations end with 
the audience leaving full of support; bad ones don’t. The 
outcome could end as a comedy or as a tragedy. If they 
don’t adopt your idea, it could end as a tragedy in which 
your once admirable hero makes a personal error by not 
moving forward with your call to action. Or if they do 
your call to action, it resolves as a comedy, which doesn’t 
necessarily mean “funny”; it means there was a rise in the 
fortune of a hero who deserved to succeed. 


“What we call the beginning is often the end. And 
to make an end is to make a beginning. The end 
is where we start from.” 


T. S. Eliot 13 
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What is What could be 


What Is a Sparkline? 


Throughout this book, the presentation form will be used 
to analyze presentations graphically as a sparkline. This 
will help you see the contrast in a presentation by visual¬ 
izing its contour. The line moves up and down between 
what is and what could be, but it also changes colors to 
signify contrasts in emotion and delivery. Each presenta¬ 
tion has its own unique pattern. No two sparklines are 
alike, because no two presentations are alike. 

Using a tool like the presentation form to achieve great 
results isn’t new. Movies and myths all have a form, 
and they yield beautiful and unique results. Similarly, 


presentations that follow the presentation form will all be 
unique. The presentation form isn’t a formula, because 
it has enormous flexibility; rigid adherence to it could 
make your presentations too predictable. So it’s equally 
important to embrace its versatility. 

Below is an annotation of how to read the sparklines 
in the book. The case study on the following pages will 
show the first use of the presentation form applied as 
a sparkline. Videos of all the presentations analyzed are 
available online along with additional annotations to the 
transcripts, www 


Establish 

Imbalance 


BEGINNING 


Turning Point 1: 
Call to Adventure 


Time Codes 0:00 
Engagement Level 
Laughter ||| | 11| | 

Clapping |||| 11| 


1 

0:05 


I I I 


III I I I I 
III II I I II 


0:10 

II I I I 
I II I I 


I I I 
I I I 


S.T.A.R. 

Moment 


MIDDLE 


0:15 

I I I INI I III I 
III III II I I I 


0:20 


Verbal Cues 


II I I I I 


I I I 


I 


I I III 


I III INI I I 


A few sparklines have a second layer of tick marks to show structural or verbal 
insights. Look for a second layer of tick marks on Feynman, Jobs, and Ortberg. 


Ill I I I I 


II I I I 
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Gray Line: 
Spoken 
Words 


Colored Lines: 
Contrasting Emotion 
and Delivery 


Y 


Y 


END 


~r 

0:25 


0:30 




Turning Point 2: 
Call to Action 


I 


0:35 


0:40 


\ 


~r 

0:45 


I 


II II I I I I 
II II I III 


I I I I INI IIIIIIIII III llll llll Hill III III III III llllllll III III III III 
II I I III III II II I III II MINI Mil II11 III II 


III I 


III II I I I llll I III I I III 


Lessons from Myths and Movies 47 




















Case Study: Benjamin Zander 

TED Talk 


Benjamin Zander has a contagious passion for classical 
music. Motivational speaker and conductor of the Boston 
Philharmonic Orchestra, he’s intent on persuading 
everyone to fall in love with classical music. And during 
his 2008 TED talk, the audience was visibly moved 
toward that end. 

If you haven’t yet seen this presentation, please watch 
it! Go to TED.com and search for Benjamin Zander to 
see this master communicator in action, www Less than 
a minute into the presentation, the audience is already 
responding to its content. They laugh early and often. 
He energetically engrosses the audience several ways: 

• Structural Contrast: Zander gracefully shifts between 
what is and what could be by establishing a clear 
gap between those in the audience who already 
passionately love classical music and those who 
feel it’s simply like second-hand smoke at the air¬ 
port. He’s determined not to leave the room until 
everyone is in love with classical music. 

• Delivery Contrast: He contrasts his delivery several 
ways. He alternates between speaking and playing 
the piano. He physically involves the audience by 
having them sing. He moves from the stage into the 
audience several times, even touching the faces of 
the audience members! He also uses large gestures 
and dramatic facial expressions. 


* Emotional Contrast: Zander tells several stories; 
some evoke laughter—some, tears. Though they 
alternate between funny and touching, each one 
connects the hearts of the listeners to the mate¬ 
rial and moves them (emotionally and behaviorally) 
toward loving classical music. 

Like all great mentors, Zander gives the audience 
members a special tool: he teaches them how to listen 
to the music. They learn to identify impulses and chord 
progressions. He trains their ears in music theory. Many 
in the audience haven’t loved classical music because 
they were unable to hear the layers of beauty within it. 
Zander unfolds these layers for them. 

Zander brilliantly uses the music as the message as he 
elicits and connects with listeners’ emotions. Having 
trained their ears to recognize the sense of longing 
created by an unresolved chord, he then goes straight 
for the heart. He asks them to remember a loved one 
who is no longer with them as he replays a piece by 
Chopin. This is the S.T.A.R. moment (page 148) in the 
presentation. Possibly for the first time in their lives, 
the audience can hear the longing in the music, and 
they are deeply moved. 

Zander demonstrates all the components of a perfect 
presentation form, which is annotated on page 50. 
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What is What could be 


Zander’s Sparkline 


Establish What Could Be 

Zander is passionate about showing the 
audience how to love classical music. He says, 
“It doesn’t work for me to [have] a wide gulf 
between those who understand, love, and 
are passionate about classical music, and 
those who have no relationship to it at all...I’m 
not going to go until every single person in 
this room...come[s] to love and understand 
classical music.” 
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Establish What Is 

After hooking the 
audience with a story, 
Zander states, “There 
are some people who 
think that classical 
music is dying.” 


Teach Them to Listen 

Zander teaches the audience how 
to listen for “impulses” in the music 
and challenges the audience to 
listen for them in his playing. He 
educates them about music theory 
and performance. 


Engage by Singing 

When he describes the Chopin prelude, 
he plays descending notes of a scale—B 
A, G, F # —then withholds the last note 
(E) and invites the audience to sing it. 
They’re reluctant at first, so he repeats 
his request. When the audience sings 
the final note, he remarks, “Oh, the TED 
choir!” eliciting laughter. 
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Emotional Contrast 

Zander taught the audience how a chord pulls the music toward the 
home key like a magnet. As the music moves away from home into 
other chords, the music feels persistently unresolved. As the music 
persists in long, unresolved chords, it creates a sense of longing until 
it finally comes back to the home key. The music wants to resolve 
and go home. Then he says, “Would you think of somebody who 
you adore who is no longer there—a beloved grandmother, a lover, 
somebody in your life who you love with all your heart but that 
person is no longer with you. Bring that person into your mind and 
at the same time follow the line all the way from B to E, and you’ll 
hear everything that Chopin had to say.” 

This time when he plays the piece, the beauty of 
longing and desire that was built into the piece 
manifests itself in the hearts of the audience. They 
can feel themselves in the music. People in the 
audience fall in love with classical music when they 
can understand it emotionally. 


Call to Action 

Zander concludes by sharing a life-changing 
realization that his job is to awaken the 
possibility in others. “And do you know how to 
find out [if you succeeded]? You look at their 
eyes. If their eyes are shining, you know you’re 
doing it.” He challenges the audience to ask the 
question of themselves: “Who are we being as we 
go back into the world? It’s not about wealth and 
fame and power. It’s about how many shining 
eyes are around you.” 
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Engage by Singing 

Though not shown in the video, Zander comes 
back for an encore in which he leads the “TED 
choir” in a rousing rendition of Beethoven’s Ode 
to Joy in German. 
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Stories have been told for thousands of years in order to transfer cultural 
lore and values. When a great story is told, we lean forward, and our 
hearts race as the story unfolds. Can that same power be leveraged for 
a presentation? Yes. 

The timeless structure of a story can contain information that persuades, 
entertains, and informs. Story serves as a perfect device to help an audi¬ 
ence recall the main point and be moved to action. Once a presentation is 
put into a story form, it has structure, creates an imbalance the audience 
wants to see resolved, and identifies a clear gap that the audience can fill 
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How Do You Resonate with These Folks? 


The instructions your high school speech teacher gave 
you to picture the audience in their underwear is now 
officially obsolete. Instead, you need to picture them 
all in colorful stockings and tunics with superhero 
emblems—because these are the heroes charged with 
carrying your big idea to fruition. 

It’s important to know what makes your audience tick 
in order to connect with them. So how do you get to 
know them and really understand what their lives are 
like? What makes them laugh? What makes them cry? 
What unites them? What incites them? What is it that 
makes them deserve to win in life? It’s important to 
figure this out because according to the former AT&T 
presentation research manager, Ken Haemer, “design¬ 
ing a presentation without an audience in mind is like 
writing a love letter and addressing it ‘to whom it may 
concern .’” 1 This section will help you create empathy 
for your audience by brainstorming the attributes of 
the hero and mentor archetypes. 

Though your heroes might be lumped together in a 
room, you shouldn’t view them as a homogeneous 

blob. Instead of thinking about the audience as a uni¬ 
fied clump when preparing your presentation, imagine 
them as a line of individuals waiting to have face-to- 
face conversations with you. You want to make each 
person feel like you’re having a personal exchange with 
him or her; it will help you speak in a conversational 


tone, which will keep them interested. People don’t fall 
asleep during conversations (unless your conversations 
are boring too. If so, you need help beyond what this 
book provides). 

An audience is a temporary assembly of individuals 
who, for an hour or so, share one thing in common: your 
presentation. They are all listening to the same message 
at the same moment; yet all of them are filtering it dif¬ 
ferently and gleaning their own unique insights, points 
of emphasis, and meaning. If you find common ground 
from which to communicate, their filter will more readily 
accept your perspective. 

As an option, you might want to create a narrowly 
targeted message for specific people in the audience 
so that your presentation comes across as a personal 
conversation with the highest-priority individuals. Even 
if only one person gets it—if it’s the right person—it’s 
worth it! 

You need to get to know these folks. You are their men¬ 
tor. Each one has unique skills, vulnerabilities, and even 
a nemesis or two. The audience must be your focus 
while you create the content of your presentation. They 
are so important, in fact, that the next two sections 
of the book will revolve around the audience. So stop 
thinking about yourself and start thinking about con¬ 
necting with them. 
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Segment the Audience 


One way to get to know your audience is through a process called segmenta¬ 
tion. By partitioning a large audience into smaller subsegments, you can 
target the segment that will bring the most additional supporters. Determine 
which group is most likely to adopt your perspective—the group with which 
you can make the greatest impact with the least effort. It’s tricky to appeal 
to the broader audience and simultaneously connect deeply with the subset 
that will play a key role in helping you—but it’s worth the effort. 

The most commonly used segmentation method is to segment by demo¬ 
graphics. Most conference organizers can provide only limited information 
about the audience: where they work, their title, geographical location, and 
company. You can make some assumptions from this information, but it’s 
limited to just that—assumptions. 

When I presented to top executives from a national beer manufacturer, I 
needed to spend time thinking about how to connect with them, because 
based on demographics alone, we did not have much in common in this arena. 
I’m a middle-aged female who drinks fruity cocktails because I imagine beer 
might taste like fizzy pee. That’s a pretty big gap. 

I didn’t receive enough information from the event organizers to feel like I 
really knew what’s important to them. 



BEER EXECUTIVES 

NANCY DUARTE 

Gender 

34 Males, 14 Females 

Female 

Job Title 

Executives with titles like 
director, vice president, 
and CMO 

Entrepreneur and CEO 

Geography 

They flew in from 11 countries 

1 drove 3.6 miles up the road 
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Collecting their gender and country of origin isn’t enough 
information to communicate with them meaningfully. 
Audiences aren’t moved solely because they are old or 
young, from Kansas or from California. Their demograph¬ 
ics are only part of the story. 

Truly communicating effectively takes research. That can 
mean sending out your own survey that will help you gain 
insights or—if you’re targeting a broader industry group- 
going online and finding popular blogs by industry icons 
to see what’s on their minds. You might take note of what 
they chat about on social media sites until you reach a 
point where you feel you know them personally. 

Don’t segment the audience in a cliched or generalized 
way. Defining your audience too broadly can make you 
seem impersonal or unprepared. It can cause your audi¬ 
ence to feel like a statistic, or like they are being narrowly 
stereotyped, which can be offensive. The main idea is 


that you need to define the audience in a way that’s 
accurate and appropriate for the kind of presentation 
you will deliver. 

Several things helped me prepare for the presentation to 
the beer executives. I bought subscriptions to a couple 
of key marketing publications to see what was being said 
about their brands, solicited feedback from my social 
network, searched for articles about them, reviewed the 
conversations in the top beer blogs, found their own 
presentations on the Web, read their press releases, and 
read their company’s latest annual report. 

The research helped me understand their challenges. 
Even though I only used a portion of the insights in the 
actual presentation, I felt like I knew them and had empa¬ 
thy for what was on their minds. Those insights helped 
me feel connected to them. 
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Case Study: Ronald Reagan 

Space Shuttle Challenger Address 


President Ronald Reagan was a skilled communicator 
who was faced with a daunting communication situation 
immediately after the Space Shuttle Challenger disaster. 

The shuttle's launch had already been delayed twice, 
and the White House was insisting that it launch before 
the State of the Union address, so it took off on January 
28,1986. This particular launch was widely publicized 
because for the first time a civilian—a teacher named 
Christa McAuliffe—was traveling into space. The plan 
was to have McAuliffe communicate to students from 
space. According to the New York Times, nearly half of 
America's schoolchildren aged nine to thirteen watched 
the event live in their classrooms. 2 After a short seventy- 
three seconds into flight, the world was stunned when 
the shuttle burst into flames, killing all seven crew mem¬ 
bers on board. 

President Ronald Reagan canceled his scheduled State 
of the Union address that evening and instead addressed 
the nation’s grief. In Great Speeches for Better Speaking, 
author Michael E. Eidenmuller describes the situation: “In 
addressing the American people on an event of national 
scope, Reagan would play the role of national eulogist. 

In that role, he would need to imbue the event with 
life-affirming meaning, praise the deceased, and man¬ 
age a gamut of emotions accompanying this unforeseen 
and yet unaccounted-for disaster. As national eulogist, 
Reagan would have to offer redemptive hope to his audi¬ 
ences, and particularly to those most directly affected 
by the disaster. But Reagan would have to be more than 
just a eulogist. He would also have to be a U.S. president 
and carry it all with due presidential dignity befitting the 
office as well as the subject matter.'’ 3 


President Reagan’s ability to credibly move in and out 
of different roles for different audience segments was a 
large part of what made him The Great Communicator. 

The speech succeeded in meeting the emotional require¬ 
ments of its various audiences by carefully addressing 
each segment. The circumstances gave a natural situ¬ 
ational segmentation; it would not have been appropriate 
for him to address them based on conventional distinc¬ 
tions of gender or political parties. 


Audience Segmentation 


Collective 

Families of 

School- 

Soviet 

NASA 

Mourners 

the Fallen 

children 

Union 


Reagan took care to connect all subaudiences to the larger 
audience of collective mourners. He brought disparate 
groups together by treating them as a single organic whole: 
A nation of people called to a place of national sorrow and 
remembrance. Eidenmuller says, "Catastrophic events do 
provide the basis for rhetorical situations. Despair, anxiety, 
fear, anger, and the loss of meaning and purpose are power¬ 
ful psycho-spiritual forces that deeply affect us all. It has 
been said that ‘without hope the people perish.’ And with¬ 
out hearing powerful and timely words of encouragement, 
the people may never find cause for hope.” 4 

The speech lasted only four short minutes. The pages that 
follow show how carefully and beautifully President Reagan 
addressed the various audiences that evening. 
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Many insights from this analysis are from Michael E. Eidenmuller’s book Great Speeches for Better Speaking. 
The text in italics denotes direct quotes from his work. 5 www 


Speech 


Analysis 


Ladies and gentlemen, I’d planned to speak to you tonight to report on 
the State of the Union, but the events of earlier today have led me to 
change those plans. Today is a day for mourning and remembering. Nancy 
and I are pained to the core by the tragedy of the Shuttle Challenger. We 
know we share this pain with all of the people of our country. This is truly 
a national loss. 


The State of the Union address is an 
annual, constitutionally sanctioned speech 
delivered like a national progress report— 
and is a significant task to reschedule. 
Reagan positions himself both outside 
the fray as one presiding over it and as one 
inside of it who shares its painful reality. 


Nineteen years ago, almost to the day, we lost three astronauts in a terrible 
accident on the ground. But we’ve never lost an astronaut in flight. We’ve 
never had a tragedy like this. And perhaps we’ve forgotten the courage it 
took for the crew of the shuttle. But they, the Challenger Seven, were aware 
of the dangers, but overcame them and did their jobs brilliantly. We mourn 
seven heroes: Michael Smith, Dick Scobee, Judith Resnik, Ronald McNair, 
Ellison Onizuka, Gregory Jarvis, and Christa McAuliffe. We mourn their loss 
as a nation together. 


Reagan positions the tragedy within a 
larger picture without losing the signifi¬ 
cance of the present tragedy. He names 
each crew member and praises them for 
their courage. To further manage our 
emotions, Reagan again calls us to national 
mourning, and establishes the primary 
audience as the collective mourners. 


For the families of the seven, we cannot bear, as you do, the full impact 
of this tragedy. But we feel the loss, and we’re thinking about you so very 
much. Your loved ones were daring and brave, and they had that special 
grace, that special spirit that says, “Give me a challenge, and I’ll meet 
it with joy.” They had a hunger to explore the universe and discover its 
truths. They wished to serve, and they did. They served all of us. 


Reagan narrows his focus to the first and 
most affected subaudience: the families 
of the fallen. He acknowledges the 
inappropriateness of suggesting how they 
should feel and offers praise they can take 
hold of with words like “daring,” "brave,” 
“special grace,” and “special spirit.” 


We’ve grown used to wonders in this century. It’s hard to dazzle us. But 
for twenty-five years the United States space program has been doing 
just that. We’ve grown used to the idea of space, and perhaps we forget 
that we’ve only just begun. We’re still pioneers. They, the members of the 
Challenger crew, were pioneers. 


Reagan then draws attention back to the 
general audience’s interest in the larger 
scientific story. He then envisions the 
crew’s place in history as transcending 
science altogether by calling them 
pioneers. The term “pioneer" cloaks them 
in a mythical covering, one dating back 
to our nation’s earliest ventures. 
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Speech 


Analysis 


And I want to say something to the schoolchildren of America who were 
watching the live coverage of the shuttle's takeoff. I know it’s hard to 
understand, but sometimes painful things like this happen. It's all part of 
the process of exploration and discovery. It's all part of taking a chance and 
expanding man’s horizons. The future doesn’t belong to the fainthearted; 
it belongs to the brave. The Challenger crew was pulling us into the future, 
and we’ll continue to follow them. 


Reagan’s next subaudience is the schoolchil¬ 
dren—an estimated five million—among whom 
are the students of Christa McAuliffe’s class and 
school. Reagan momentarily adopts the tone of 
an empathizing parent, which is tough to do 
while remaining "presidential,” but Reagan 
carries it well. 


I’ve always had great faith in and respect for our space program. And what 
happened today does nothing to diminish it. We don’t hide our space pro¬ 
gram. We don’t keep secrets and cover things up. We do it all up front and 
in public. That’s the way freedom is, and we wouldn’t change it for a minute. 

We’ll continue our quest in space. There will be more shuttle flights and 
more shuttle crews and, yes, more volunteers, more civilians, more teach¬ 
ers in space. Nothing ends here; our hopes and our journeys continue. 


Here, Reagan the national eulogist hands off to 
Reagan the U.S. president. This passage contains 
the only political statement in the address and is 
targeted at the Soviet Union. He attacks the secrecy 
surrounding their failures, which had irked American 
scientists who knew that shared knowledge was 
the best way to ensure the stability and safety of 
space programs. 


I want to add that I wish I could talk to every man and woman who works 
for NASA or who worked on this mission and tell them, “Your dedication 
and professionalism have moved and impressed us for decades. And we 
know of your anguish. We share it.” 


In this direct address to NASA, Reagan gives 
needed encouragement, and then turns back 
again to connect to the whole audience by 
saying, “We share it.” 


There’s a coincidence today. On this day 390 years ago, the great explorer 
Sir Francis Drake died aboard ship off the coast of Panama. In his lifetime the 
great frontiers were the oceans, and a historian later said, “He lived by the 
sea, died on it, and was buried in it.” Well, today we can say of the Challenger 
crew, “Their dedication was, like Drake’s, complete.” 

The crew of the Space Shuttle Challenger honored us by the manner in which 
they lived their lives. We will never forget them, nor the last time we saw 
them, this morning, as they prepared for their journey and waved goodbye 
and “slipped the surly bonds of earth” to “touch the face of God.” Thank you. 


In closing, Reagan creates an eloquent and poetic 
moment. It captures the mythological sentiment 
surrounding humanity’s unending quest to solve 
the mysteries of the unknown. The phrase “touch 
the face of God” was taken from a poem entitled 
"High Flight,” written by John Magee, an American 
aviator in WWII. Magee was inspired to write the 
poem while climbing to thirty-three thousand 
feet in his Spitfire. It remains in the Library of 
Congress today. 


Get to Know the Hero 63 



Meet the Hero 


It helps to split an audience into segments—but humans 
are more complex than that. In order to connect person¬ 
ally, you have to bond with what makes people human. 
Take time to analyze their lives, and valuable insights 
will appear. After all, it’s tough to influence people 
you don’t know. 

At the beginning of a movie, the hero’s likability is estab¬ 
lished. The same applies to a presentation. Successful 
Hollywood screenwriter Blake Snyder coined the phrase 
"save the cat” to describe a hero’s likability. Snyder says 
that a “save the cat” scene is “where we meet the hero 
and he or she does something—like saving a cat—that 
defines who he is and makes the audience like him.’’ 6 
By answering the questions on the right, you’ll uncover 
what makes your hero likable. 

Liking your audience members is the first step in being 
genuine with them. Study them. What would a walk in 
their shoes be like? What keeps them up at night? What 
are they called to do that will make a difference on this 
earth? Imagine their lives by the day, hour, and minute. 

Remember, because they are human, their lives are 
messy. They might have a sick child at home, might not 
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have slept well on the hotel pillow, might not be making 
ends meet financially, or just might not feel on top of 
their game. Look for insights into how your idea will 
alleviate pressure on them if they take action. 

It’s easy to focus on what they do for their career; these 
questions help you think about who they are. There’s a 
difference. Knowing their titles isn’t enough. Let’s say 
you’ll be speaking at a Human Resources event and the 
bulk of attendees are directors of human resources. Go 
online and figure out how much money they make. Is 
it enough to get by based on where they live? How do 
you imagine they would spend their paychecks? What 
are the typical temperaments of people in their role? 
Are they spontaneous or methodical? 

Keep answering the questions until you move away 
from what your audience members do for a job and 
begin to acquaint yourself with who they are as people. 
You can imagine their childhood. What games did they 
play? What was home life like? What TV shows shaped 
their psyche? Anything that will generate a connection. 

Your goal is to figure out what your audience cares 
about and link it to your idea. 


Who They Are 


LIFESTYLE 

What's likable and special 
about them? What does 
a walk in their shoes 
look like? Where do they 
hang out (in life and on 
the Web)? What's their 
lifestyle like? 


VALUES 

What’s important to 
them? How do they 
spend their time and 
money? What are their 
priorities? What unites 
them or incites them? 


KNOWLEDGE 

What do they already 
know about your topic? 
What sources do they 
get their knowledge 
from? What biases do 
they have (good or bad)? 


INFLUENCE 
Who or what influ¬ 
ences their behavior? 
What experiences 
have influenced their 
thoughts? How do they 
make decisions? 


MOTIVATION AND DESIRE 

What do they need or 
desire? What is lacking 
in their lives? What gets 
them out of bed and 
turns their crank? 


RESPECT 

How do they give and 
receive respect? What 
can you do to make 
them feel respected? 
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Meet the Mentor 


Now that you’ve spent time getting into the audience's 
hearts and heads, it’s time to look at your role as 
mentor. But wait—weren’t you told earlier not to think 
about yourself? So which is it? It does seem like a 
contradiction, but mentors are selfless and think of 
themselves in the context of others. These exercises 
will help you think about yourself in terms of what you 
can give the audience. 

Your role as mentor is to influence the hero (audience) 
at critical junctures of his life. The mentor’s appearance 
in the journey is essential to moving the hero past the 
blockades of doubt and fear. Mentors usually have two 
major responsibilities: teaching and gift-giving. 

In the movie The Karate Kid (1984), Mr. Miyagi not only 
teaches protege Daniel the “tool” of karate; he gives 
him insights into the meaning of life: 


Miyagi: 

Daniel: 

Miyagi: 

Daniel: 

Miyagi: 


What matter? 

I'm just scared. The tournament and everything. 
You remember lesson about balance? 

Yeah. 

Lesson not just karate only. Lesson for whole 
life. Whole life have a balance. Everything be 
better. Understand? 
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What insights into life can you give the audience? Draw 
on your own deep truths and transfer to your audience 
a sense for what it would be like for them to walk fully in 
their calling. 

Stay mindful of how you fit into their lives. You are mak¬ 
ing a small appearance in your hero’s grand life story to 
help get him unstuck and provide him with the resources 
to help him on his journey. Yes, you have important infor¬ 
mation to convey—maybe even a deal to close—but your 
presentation should offer something valuable too. 

The mentor should provide the hero with important, 
useful, previously unknown information. You should also 
motivate the hero when she is fearful or hesitant and 
give the hero tools for her tool belt. These tools could be 
roadmaps for success, new communication techniques, 
or even insights into her soul. No matter what the tool is, 
the audience should leave each presentation knowing 
something they didn’t know before and with the ability 
to apply that knowledge to help them succeed. 

You mustn’t come across as if the audience is helping 
you on your journey. You’re to be a gift to them. Every 
once in a while, mentors gain something from the rela¬ 
tionship for themselves, like knowledge or insights—but 
that shouldn’t be your goal. An audience can always 
spot selfish motives. 


Miyagi was a pretty smart dude. He got 
his deck sanded, car washed, plus his 
fence and house painted out of the deal. 
At times, there’s benefit to the mentor, 
but the greater benefit should always 
be for the hero. 


What You 
Give Them 


GUIDANCE 

What insights and 
knowledge will help them 
navigate their journey? 


CONFIDENCE 

How can you bolster 
their confidence so 
they aren't reluctant? 


TOOLS 

What tools, skills, or magical 
gifts do they gain from you 
on their journey? 
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Create Common Ground 


Creating common ground with an audience is like clear¬ 
ing a pathway from their heart to yours. By identifying 
and articulating shared experiences and goals, you 
build a path of trust so strong that they feel safe cross¬ 
ing to your side. You develop credibility without coming 
across as arrogant. Even your magnificent qualifications 
should be revealed in a humble and selfless way that 
connects with them. 

Sharing keen insights and a magical tool or two is great, 
but if you aren't credible, your audience won't listen. As 
you present, they’re sizing you up: Is she articulate? Is 
she qualified? Do I like her? It’s human nature for people 
to compare and validate others against their own crite¬ 
ria and experiences before adopting a new perspective. 

Focusing on commonalities bolsters credibility, so 
spend time uncovering similarities. Seek out shared 
experiences and goals that you can bring to the fore¬ 
ground. A presentation that creates common ground 
has the potential to unite a diverse group of people 
toward a common purpose—people who normally 
might never have unified because of their great diver¬ 
sity. People set aside differences when they’re strongly 
connected to achieving a common goal. 

If a presentation goes badly awry, it’s easy to blame 
the audience for misinterpretations and say, ’’That’s 
not what I meant. Flow could they be so dumb?” In the 
blame game, all ten fingers should be pointed at you, 
not the people “misinterpreting” your presentation. You 
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chose the words and images to convey your idea; if it 
didn’t align with the audience’s experiences, you need to 
own up to the misunderstanding. 

I had one of those “why doesn’t the audience get this 
obvious idea” moments when conveying our company 
vision in 2007. My employees are not blind; my communi¬ 
cation was flawed. Flaving been through three significant 
economic downturns, it was easy to see the next one 
coming a mile away. I knew that the firm needed to make 
some immediate changes that would help us weather the 
storm. But to the team, everything seemed safe and 
stable. So when I delivered an urgent “danger is eminent” 
message, it backfired. At the end of my dramatic presen¬ 
tation, my employees sat stunned, feeling like I was trying 
to manipulate them by telling them the sky was falling. 
What I thought was a presentation dripping with insight 
and urgency, my young staff—who had only known 
prosperity and stability—perceived as manipulative. My 
message and means of communication slowed progress 
to a crawl. A handful understood, but getting everyone 
on board proved almost insurmountable. It took an entire 
year to reframe the issues and build momentum. Even 
though a downturn was coming, the idea had no traction 
because I didn’t use symbols or experiences to which my 
audience could connect. 

An audience chooses whether to connect to you or not. 
People will usually respond only if it’s in their best interest. 

Personal values will ultimately drive their behavior, so 

ideally you should identify and align with existing values. 


How You 
Connect 
with Them 


SHARED EXPERIENCES 

What from your past 
do you have in common: 
memories, historical 
events, interests? 


COMMON GOALS 

Where are you headed 
in the future? What 
types of outcomes 
are mutually desired? 


QUALIFICATIONS 

Why are you uniquely 
qualified to be their guide? 
What similar journey 
have you gone on with 
a positive outcome? 
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Communicate from the Overlap 


Why do you have to go through all these questions about the audience 
and yourself? Connecting empathetically with an audience requires devel¬ 
oping understanding and sensitivity to their feelings and thoughts. 

People come to a presentation with their own facts and emotions stored 
neatly in their heads and hearts. People are wired to absorb information 
and transform it into personal meaning that shapes their perspectives. 

It's the presenter's job to know and tune into the audience’s frequency. 
Your message should resonate with what’s already inside them. As a 
presenter, if you send a message that is tuned to the “frequency” of 
their needs and desires—they will change. They might even quiver with 
enthusiasm and move together to create beautiful results (page 4). 

When you know someone well, your common experiences create shared 
meaning. My husband, Mark, can say just one word that is packed with 
enough meaning, and I’m howling with laughter on the ground. Granted, 
you probably haven’t been married to your audience for thirty years—but 
if you do your homework, they will feel like a good friend. And friends 
know how to persuade one another. They have a natural way of swaying 
each other toward their perspective. 

Establishing how you’re alike also clarifies how you’re different. Once 
you’ve identified the overlap, you’ll have a clearer understanding of 
what’s outside the overlap that needs to be embraced by the audience. 

Your objective is to find the most relevant and believable way to link 
your issue to your audience’s top values and concerns. 
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• = Fact 

• = Emotion 
§£ = Overlap 


“If any man were to ask me what I would suppose to be the perfect style 
of language, I would answer, that in which a man speaking to five hundred 
people, of all common and various capacities, idiots or lunatics excepted, 
should be understood by them all, and in the same sense which the speaker 
intended to be understood.” 


Daniel Defoe 7 


Get to Know the Hero 71 


When you know someone, really know them, it's easy to persuade them. 
Investing time into familiarizing yourself with the audience solidifies your 
ability to persuade. 

Meet the Hero: The audience is the hero who will determine the outcome 
of your idea, so it’s important to know them fully. Jump into the shoes of 
your audience and look carefully at their lives. Picture them as individuals 
with complex lives. Identify with their feelings, thoughts, and attitudes. 
Discover their lifestyles, knowledge, desires, and values. Painting a picture 
of who they are in their ordinary world helps you connect with them and 
communicate from a place of empathy. 

Meet the Mentor: Embracing the stance of mentor clothes you in humility. 
It moves you from forcing information on “an ignorant audience” to giving 
them valuable tools to guide them on their journey or help them get unstuck. 
They should leave with valuable insights they didn’t have before they met 
with you. 

When an audience gathers, they have given you their time, which is a pre¬ 
cious slice of their lives. It’s your job to have them feel that the time they 
spent with you brought value to their lives. 


72 Resonate 




















Preparing for the Audience’s Journey 


Presentations should have a destination. If you don't 
map out where you want the audience to be when they 
leave your presentation, the audience won’t get there. 

If a sailor wanted to travel to Hawaii, he wouldn’t hop in 
the boat, open the sails, guess at a direction, and fully 
expect to arrive after a few days of sailing. It simply 
doesn’t work that way. You have to set a course, and 
that means developing the right content. The destina¬ 
tion you define can serve as a guide. Every bit of 
content you share should propel the audience toward 
that destination. 

Keep in mind that a presentation is designed to trans¬ 
port the audience from one location to another. They 
will feel a sense of loss as they move away from their 
familiar world and closer to your perspective. You are 
persuading the audience to let go of old beliefs or habits 
and adopt new ones. When people deeply understand 
things from a new perspective to the point where they 
feel inclined to change, that change begins on the inside 


(heart and mind) and ends on the outside (actions and 
behavior). However, this typically doesn’t happen with¬ 
out a struggle. 

That struggle usually manifests as resistance—something 
that can be harnessed if you plan for it. When a sailboat 
is sailing against the wind, the sails are positioned to har¬ 
ness the wind. If done well, the boat sails faster than the 
wind itself—even though the gusts are opposing it. While 
you might not be able to control the severity of audience 
resistance, you can “adjust your sails” (message) and use 
it to gain momentum. When harnessed properly, the seem¬ 
ingly counterproductive force creates forward progress. 
However, just like sailing, it needs to move back and forth 
to get there (just like the presentation form). 

The journey should be mapped out, and all related mes¬ 
sages should propel the audience closer to the destination. 
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The Big Idea 


A big idea is that one key message you want to commu¬ 
nicate. It contains the impetus that compels the audi¬ 
ence to set a new course with a new compass heading. 
Screenwriters call this the “controlling idea.” It has also 
been called the gist, the take-away, the thesis statement, 
or the single unifying message. 

There are three components of a big idea: 

ONE A big idea must articulate your unique point of 
view. People came to hear you speak; since they want 
to know your perspective on the subject, you should 
give it to them. For example, “the fate of the oceans” 
is merely a topic; it’s not a big idea. “Worldwide pollu¬ 
tion is killing the ocean and us” is a big idea that has a 
unique point of view. The big idea doesn’t have to be 
so unusual that no one has ever heard of it before. It 
just needs to be your point of view on the subject 
rather than a generalization. 

two A big idea must convey what’s at stake. The 

big idea should articulate the reason why the audience 
should care enough to adopt your perspective. You 
could say your idea is to “replenish the wetlands through 
new legislation.” But compare that to “Without better 
legislation, the destruction of the wetlands will cost the 
Florida economy $70 billion by 2025.” Conveying what’s 
at stake helps the audience recognize the need to partici¬ 
pate and become heroes. Without a compelling reason 
to move, a big idea falls flat. 

three A big idea must be a complete sentence. Stating 
the big idea in sentence form forces it to have a noun 
and a verb. When asked the question “What’s your pre¬ 
sentation about?” most people respond with something 
like “It’s the third-quarter update” or “It’s about new 
software.” These are not big ideas. A big idea has to be 
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a complete sentence: “This software will make your team 
more productive and generate a million dollars in revenue 
over two years.” It’s even better if the word “you” is used 
in the sentence; that ensures that it’s written to someone. 

Emotion is another important component to the big idea. 
Boiling down all of the various emotions simplifies this 
task. Ultimately, there are only two emotions—pleasure 
and pain. A truly persuasive presentation plays on those 
emotions to do one of the following: 

• Raise the likelihood of pain and lower the likelihood of 
pleasure if they reject the big idea. 

• Raise the likelihood of pleasure and lower the likelihood 
of pain if they accept the big idea. 1 

For example, a business presentation that centers on 
“We are losing our competitive advantage” as its big 
idea has nothing at stake. In contrast, the message “If we 
don’t regain our competitive advantage, your jobs are in 
jeopardy” makes it clear that there’s plenty at stake! It 
appeals to employees’ human instinct to survive. Flumans 
change when there is a threat and sense of urgency. In 
the January 2007 issue of Harvard Business Review, John 
P. Kotter explained that “most successful change efforts 
begin when some individuals or groups start to look hard 
at a company’s competitive situation, market position, 
technological trends, and financial performance. They 
then find ways to communicate this information broadly 
and dramatically, especially with respect to crises, poten¬ 
tial crises, or great opportunities that are very timely.” 2 

The gravity of the presentation should match the severity 
of the situation and accurately reflect what’s at stake—no 
more, no less. 


A Big Idea 


THESE ARE NOT BIG IDEAS 


Lunar Mission 


Client Sales Call 


Third-Quarter Update 


YOUR UNIQUE 
POINT OF VIEW 
ON A TOPIC 


A CLEAR STATEMENT 
OF WHAT’S AT STAKE 
FOR THOSE WHO DO 
OR DON’T ADOPT YOUR 
POINT OF VIEW 


WRITTEN IN 
THE FORM OF 
A SENTENCE 


THESE ARE BIG IDEAS 


The United States should lead 
in space achievement because 
it holds the key to our future 
on Earth. 


JFK knew that no one could predict the 
outcome of the space race, but he believed 
it would determine who wins the battle 
between freedom and tyranny. 


Our software gives your cus¬ 
tomers access to their records, 
which saves your employees 
time and increases your mar¬ 
gins by 2 percent. 

Third-quarter numbers are 
down; and to stay in the game, 
every department needs to 
support the sales initiative. 
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Plan the Audience’s Journey 


Now that you've established the big idea and defined 
the destination, it’s time to map out the journey. 
Remember, persuasion requires that you ask the audi¬ 
ence to change in some way; and most change com¬ 
pels people to move from one way of being or doing 
and move to a new way of being or doing. Many times 
there’s an internal, emotional change that must occur 
before they show signs of external change through 
their behavior. 

Change is interesting to watch. We go to the movies 
or read a book to see the change that happens in the 
main character. This carefully planned change is called 
the character arc —the identifiable internal and external 
change that the hero endures. 

When a screenplay is submitted for acquisition to a 
studio, a story analyst evaluates it by assessing the 
quality of the character arc. The story analyst deter¬ 
mines the quality fairly quickly simply by looking at the 
first and last pages of the script. The first page sets up 
who the hero is when the movie begins, and the last 
page determines how much the hero changed dur¬ 
ing its course. This quick assessment of a screenplay 
determines whether the hero’s journey changed her at 
all. If the hero didn’t change enough by the last page, 
it’ll be a boring film. 3 Great stories show growth and 
transformation in the characters. 


In the same way a story analyst looks at the first and last 
page of a screenplay, you must envision and study your 
audience at the beginning of your presentation—and 
whom you want them to be when they leave. Upon enter¬ 
ing the room, your audience holds a point of view about 
your topic that you want to change. You want to move 
them from inaction to action; you want them to leave the 
room holding your perspective as dear and committing 
to it. This won’t happen without a carefully planned map. 

To plan an audience journey, identify where both from 
and to you want to move the audience. Identify both 
their inward and outward transformation. If you change 
them on the inside, you can usually observe that through 
their actions. This outward change is the proof that they 
understood and believed the big idea. Changing beliefs 
changes actions. 

You might be thinking, “Gosh, I’m just presenting at my 
staff meeting; 1 can skip this step.” Perhaps a better 
option, in that case, would be writing and distributing 
a report. Although, if your staff meeting is about the 
status of a project that is over budget, you better get in 
there and move them from thinking being over budget is 
okay, to taking responsibility and working hard to ensure 
the budget gets back on track. This, then, is a persuasive 
situation that requires a clearly defined journey. 
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MOVE FROM 


MOVE TO 


Audience Journey 


AUDIENCE JOURNEY 
FOR JFK’S LUNAR SPEECH 
TO CONGRESS IN 1961 


MOVE FROM ONE MANNER OF 

BEING (INWARD CHANGE) 

— 

MOVE TO A NEW 

MANNER OF BEING 

MOVE FROM ONE MANNER OF 

DOING (OUTWARD CHANGE) 

- 

MOVE TO A NEW 

MANNER OF DOING 

MOVE FROM 


MOVE TO 

Feeling the plan is too 
risky and impossible 
within the ten-year 

time constraint. 

- 

A sense of urgency 
because the Soviets have 

a head start and could 

remain in the lead. 

Move from approving 
only a portion of 
the budget. 

— 

Move to approving entire 
$7 to $9 billion additional 
budget over next five years. 
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Tools for Mapping a Journey 


These pages contain a couple tools that can help trigger ideas as you map 
out the audience journey. 

Below is a list of words culled from various articles on change management. 
It's not an exhaustive list of every type of change, but it can help spark ideas 
for how you want your audience to be transformed. 


MOVE FROM 

—• 

MOVE TO 

MOVE FROM 

—- 

MOVE TO 

MOVE FROM 

—- 

MOVE TO 

Abstain 


Try 

Disengage 


Engage 

Misunderstand 


Understand 

Accuse 


Defend 

Dislike 


Like 

Naysayer 


Advocate 

Apathy 


Interest 

Disregard 


Examine 

Nemesis 


Ally 

Aware 


Buy 

Dissuade 


Persuade 

Obligated 


Passionate 

Cancel 


Implement 

Divide 


Unite 

Passivist 


Activist 

Chaos 


Structure 

Doubt 


Believe 

Pessimistic 


Optimistic 

Close-minded 


Open-minded 

Exclude 


Include 

Reject 


Accept 

Complicate 


Simplify 

Exhaust 


Invigorate 

Resist 


Yield 

Conceal 


Familiarize 

Forget 


Remember 

Retreat 


Pursue 

Confused 


Clear 

Hesitant 


Willing 

Risky 


Secure 

Control 


Empower 

Hinder 


Facilitate 

Sabotage 


Promote 

Deconstruct 


Establish 

Ignorant 


Learn 

Skeptical 


Hopeful 

Delay 


Do 

Ignore 


Respond 

Standardize 


Differentiate 

Despise 


Desire 

Impotence 


Influence 

Stay Put 


Begin 

Destroy 


Create 

Improvise 


Plan 

Think 


Know 

Disagree 


Agree 

Individual 


Collaborator 

Unclear 


Clear 

Disapprove 


Recommend 

Invalidate 


Validate 

Uncomfortable 


Comfortable 

Disband 


Assemble 

Irresponsible 


Responsible 

Undermine 


Support 

Discontent 


Content 

Keep Quiet 


Report 




Discourage 


Encourage 

Maintain 


Change 





82 Resonate 






Determine Where They Need to Move to in a Process 

Many presentations occur to move an audience from being stuck on a 
project to being unstuck. Projects and processes reach critical junctures 
where the team needs encouragement and prodding or the project could 
miss the deadline or stagnate. Another way to determine the journey is 
to assess the process and determine what phase they should be in (or 
are stuck in) and prepare your messages to move them from the current 
phase to the next phase. For example, you might want to move a client 
along to the next step in the sales cycle, which means you might need to 
move them from interest to evaluating your product. The graphic to the 
right lists common processes. The graphic below shows a master idea life 
cycle that you can use to get your idea unstuck. 


Process Segmentation 

Determine the phase in the process the 

audience needs to move to: 

• By Project Process: Analyze, design, 
develop, implement, evaluate 

• By Sales Cycle Process: Awareness, 
interest, desire, evaluation, action, loyalty 

• By Adoption Process: Innovators, early 
adopters, majority, laggards 


IDEA LIFE CYCLE 
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Acknowledge the Risk 


People have an innate sense of fear when embarking on 
a journey with an unknown outcome. This unknown ele¬ 
ment is what makes change so frightening. 

Change involves the addition of the new and the aban¬ 
donment of the old. In order for new societies to rise, old 
societies must therefore fall. New technology emerges 
while old technology obsolesces. Even in persuasion, to 
accept something new means sacrificing something else. 

Sacrifice is defined as the surrender or destruction of 
something prized or desirable for the sake of something 
considered as having a higher or more pressing claim. 
Often, our audience can't change unless a sacrifice is 
made. A trade-off. Letting go. 

To adopt your perspective, the audience has to, at a 
minimum, abandon what they previously held as true. 

Changing their minds is like asking them to forsake 
an old friend who has stood by them for a long time. 
Losing an old friend hurts. 
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Even something seemingly trivial—like a forfeit of their 
time—might require them to risk something. Working late 
might mean missing volleyball practice or the chance 
to tuck their kids into bed at night. Be cognizant of the 
sacrifice the audience will make when you ask them to 
do something, because you’re asking them to give up 
a small—but still irretrievable—slice of their lives. If you 
consider the potential risks that the audience will face 
when you ask them to buy into your big idea, you will be 
prepared to manage their apprehension and respond 
effectively to overcome it. 

The source of audience resistance is usually related to 
the sacrifice they know will be required of them. Parting 
with their time or money is a loss to them. Your presen¬ 
tation is a disruption to their contented stance. You’re 
saying they need to buy your product, be more produc¬ 
tive, or join a movement, but they think they are fine 
right where they are. 

Change requires a breaking down before there’s a build¬ 
ing up, and this is where the audience needs the encour¬ 
agement from the mentor most of all. 



Audience transformation is guided along a 
grand plan similar to the metamorphosis of 
a butterfly. After the caterpillar creates a 
hard, protective cocoon, what happens on 
the inside is almost tumultuous. The solids 
of the caterpillar liquefy and regroup into 
a completely different form. A butterfly. 


Empathize with 
Their Sacrifice 
and Risk 


SACRIFICE 

What would they sacrifice 
to adopt your idea? What 
beliefs or ideals will be let 
go? How much will it cost 
them in time or money? 


RISK 

What’s the perceived 
risk? Are there physical 
or emotional risks they 
will need to take? How will 
this stretch them? Who 
or what might they have 
to confront? 
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Address Resistance 

Refusal of the Call 


There's no doubt about it; most people do not enjoy 
change and will resist. An audience might understand 
your plea, and even mentally accept it, but they still 
might not be moved to action. 

In the July 2008 issue of Harvard Business Review, John 
P. Kotterand Leonard A. Schlesinger reported, “All peo¬ 
ple who are affected by change experience some emo¬ 
tional turmoil. Even changes that appear to be ‘positive’ 
or 'rational' involve loss and uncertainty. Nevertheless, 
for a number of different reasons, individuals or groups 
can react very differently to change—from passively 
resisting it, to aggressively trying to undermine it, to 
sincerely embracing it.” 4 

Audience members will often push back or try to find 
errors in your presentation because if they don’t, they 
have to either live with the contradiction between their 
old position and the new one you’ve “sold” them, or 
opt to change. Their resistance could be as subtle as 
skepticism or as destructive as a revolt, and you must 
deal with it squarely. How do you modify your communi¬ 
cation to move the audience from aggressively trying to 
undermine your message to sincerely embracing it? 

Carefully contemplate all the ways in which your audi¬ 
ence might resist. What attitudes, fears, and limitations 
do they use as a tool to oppose implementing the idea? 
After identifying their reasons for refusal, use those 
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concerns as inoculants. State the opposing points before 
they get a chance to refute your point. 

An inoculation purposefully infects a person to minimize 
the severity of an infection. The same takes place when 
you empathetically address an audience’s refusals by stat¬ 
ing them openly in your talk. This will help them see that 
you’ve thought through everything—which will decrease 
their anxiety. 

Most people don’t resist simply for the sake of resis¬ 
tance (although some do). Most resist because you’re 
asking them to do something that requires them to 
take a risk or make a sacrifice of a varying degree. For 
example, asking people to buy a product could make 
them feel as if they are risking their reputations by 
spending company money on a product with an unpre¬ 
dictable outcome. 

What you perceive as resistance may be viewed com¬ 
pletely differently in the audience members’ minds. 
They might resist your message because, from their 
perspective, it puts their reputations, credibility, or 
honor on the line. If the audience takes this stance with 
your message, what you see as resistance they see as 
valor. They are protecting the things they hold dear and 
responding appropriately. Acknowledge their resistance 
while simultaneously assuring them that they are in 
good hands with you, their mentor. 


Refusal of the Call 


COMFORT ZONE 

What’s their tolerance 
level for change? Where is 
their comfort zone? How 
far out of it are you asking 
them to go? 


MISUNDERSTANDING 

What might they misunder¬ 
stand about the message, 
the proposed change, or 
the implications? Why 
might they believe the 
change doesn’t make 
sense for them or their 
organization? 


FEAR 

What keeps them up at 
night? What’s their great¬ 
est fear? What fears are 
valid, and which should 
be dispelled? 


OBSTACLES 

What mental or practical 
barriers are in their way? 
What obstacles cause fric¬ 
tion? What will stop them 
from adopting and acting 
on your message? 


VULNERABILITIES 

In which areas are 
they vulnerable? Any 
recent changes, errors, 
or weaknesses? 


POLITICS 

Where is the balance 
of power? Who or what 
has influence over them? 
Would your idea create 
a shift in power? 
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Make the Reward Worth It 


Whether it’s based on altruism or ego, people like to 
make a difference with their lives. That difference could 
be something as modest as “make this a great place to 
work” or as lofty as "save lives in Ethiopia.” 

No matter how stimulating you make your plea, an 
audience will not act unless you describe a reward that 
makes it worthwhile. The ultimate gain must be clear, 
whether it relates to their extended sphere of influence 
or possibly even all of mankind. If they are sacrificing 
their time, money, or opinion for your call to action, 
make it obvious what the payoff will be. 

Rewards should appeal to physical, relational, or self- 
fulfillment needs: 

• Basic needs: The human body has basic needs like food, 
water, shelter, and rest. When any of those are threat¬ 
ened, people will risk life and limb to secure them—even 
for someone else. People don't like to see others’ basic 
needs go unmet, and this prompts generosity. 

• Security: People want to feel secure and safe at home, 
at work, and at play. Physical, financial, or even tech¬ 
nological security assures them that they are safe. 

• Savings: Time and money are two precious commodi¬ 
ties. Your presentation’s reward might be saving 
the audience time or creating a generous return on 
their investment. 
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• Prize: This can be anything from a personal financial 
reward to gaining market share. It is the privilege of 
taking possession of something. 

• Recognition: People relish being honored for their 
individual or collective efforts. Being seen in a new light, 
receiving a promotion, or gaining admission into some¬ 
thing exclusive are all giving recognition. 

• Relationship: People will endure a lot for the promise 
of community with a group of folks who make a differ¬ 
ence. A reward can be as simple as a victory celebra¬ 
tion with those they love. 

• Destiny: Guiding the audience toward a lifelong dream 
fulfills the need to be valued. Offer the audience a 
chance to live up to their full potential. 

In light of these categories, ask yourself the following: 
What is it that the audience gets in exchange for 
changing? What is in it for them? What do they gain 
by adopting your perspective or buying your product? 
What value does it bring to them? 

As you’ve learned from The Hero’s Journey, the hero 
leaves the ordinary world, enters a special world, and 
returns not only changed as a human being but bearing 
an Elixir—a reward for having taken the journey. The 
reward for your audience should be proportional to the 
sacrifice they have made. 


Identify the Reward 
(new bliss) 


BENEFIT TO THEM 

How will they personally 
benefit from adopting 
your idea? What’s in it 
for them materially 
or emotionally? 


BENEFIT TO SPHERE 

How will this help their 
sphere of influence such 
as friends, peers, students, 
and direct reports? How can 
they use it to their benefit 
with those they influence? 


BENEFIT TO MANKIND 

How will this help the 
humans or the planet? 
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Case Study: General Electric 

Showing the Benefit of Change 



As one of the largest organizations in the world, General Electric 
places tremendous value on innovation. They solve today’s problems 
while imagining new innovations that shape the future. Admittedly, in 
this process, yesterday’s innovations are obsolesced by tomorrow’s 
needs. The organization is constantly in a state of flux between what 
is and what could be. 

Communicating within this atmosphere of innovative tension isn’t 
always easy. Chief marketing officer, Beth Comstock, has led a team 
that has navigated this territory effectively. Many of Comstock’s pre¬ 
sentations address the contrast of what is versus what could be. 


Comstock coupled her contrasting words with 
contrasting images to amplify her message. 






Comstock delivered the presentation featured on the next 
few pages to persuade her sales and marketing team that 
“growth in a downturn” is possible (notice the contrast 
even in her title). She wanted to move her team from the 
defeatist mindset of a downturn (what is) to believing they 
could innovate in a downturn (what could be). It’s common 
for her presentations to address the theme of navigating 
through the tension of innovation. 

Comstock sprinkles her communication with personal 
stories of risk, frailty, and victories, which makes her cred¬ 
ible and transparent. She once even shared how previous 


Growth in a Downturn? 

Jeff Immelt took over as CEO of GE in 2001 with a strategy 
to grow the company from within while investing more in 
technology/innovation, global expansion, and customer 
relationships. To make this happen, GE needed a stronger 
marketing organization to sit beside technology, sales, 
and the regional business leaders. For decades, GE was so 
confident in its products that it believed the products could 
practically market themselves. Then, a collective awakening 
occurred: Seasoned marketers could push GE to go more 
places, organize technologies to accomplish new feats, and 
help point the company in the direction of even more sales. 

GE set an aggressive course in 2003 to double its market¬ 
ing talent and build new capabilities. Comstock was brought 
on as the first CMO in decades. GE marketers established 
a marketing-led innovation portfolio and process across 
GE that creates between $2 and $3 billion a year in new 
revenue. Through this effort, GE defined marketing inno¬ 
vation as a necessary partner for technical and product 



GE CEO Jack Welch called her only to hang up the phone 
midsentence. When Comstock called his assistant, she 
was told, “He's teaching you a lesson—that’s how you come 
across sometimes.” It was a stark lesson about leading and 
coaching with humor. 

Comstock is a natural at communicating contrast. The setup 
of her presentation is below. The content has been edited on 
the following pages into a “move from,” “move to,” “benefit,” 
and “personalized story" matrix so you can see the brilliant, 
underlying structure she inherently used. 


innovation. Marketers were a critical part of the team that 
drove 8 to 10 percent organic growth—more than double 
the historic rate. 

But by 2008, a global economic crisis was wreaking havoc 
on growth rates and changing customer behavior. What 
happens when growth stalls? Was it time for GE to cut 
marketing? The decision was just the opposite. Marketing 
needed to be valued as a function for all seasons. 

Comstock was inspired by research conducted by Harvard 
Business School’s Ranjay Gulati. Gulati observed that compa¬ 
nies that relentlessly focus on the customer and invest more 
in the pipeline in a downturn can expect to stay ahead for up 
to five years after recovery. Now that gets your attention! 

GE’s goal in 2008 was to stay focused on growth, no matter 
how tough the environment. GE needed to plant seeds so 
it would be poised when recovery happened. That meant 
investing in new opportunities and encouraging new ideas. 
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Foster Creativity 

Move to Being 

Move from being uncomfortable with 
creativity to believing that everyone can 
be creative—it's frightening to move 
outside a comfort zone. 

Move to Doing 

Move from chaotic to organized through 
"freedom within a framework.” Define the 
problem, make room for ideas, and work 
as individuals and teams. 

Benefit/Outcome 

Creativity takes planning in multiple itera¬ 
tions, but good process helps ideas stick 
and energizes the team. 

Personalize 

A team of nuclear scientists went behind 
the scenes at NASCAR to learn simi¬ 
larities in the way race cars and nuclear 
plants are serviced. For me, keeping an 
idea journal is a helpful way to create a 
"space” to ideate. 


Navigate Ambiguity 

Move to Being 

Move from being paralyzed by not 
knowing all the answers to accept¬ 
ing that you will never know all 
the answers. 

Move to Doing 

Move from fear of starting to pick¬ 
ing a path, knowing that where you 
end up could be very different than 
where you started. 

Benefit/Outcome 

Removing ambiguity helps you face 
reality, make the tough calls, and be 
flexible with new approaches. 

Personalize 

Jack Welch taught me the impor¬ 
tance of wallowing. Having spent 
many years in fast-paced news 
environments, Jack taught me how 
to get to know ideas and people. 



Take Risks 

Move to Being 

Move from being afraid of instigat¬ 
ing ideas to fighting for a better way. 
Instigators are rarely welcomed but 
are critical to the creative process. 

Move to Doing 

Move from low visibility to moving 
forward without the answers. Ideas 
need a champion to turn them into 
action, so executive buy-in is critical. 

Benefit/Outcome 

If you do not jump in, you will regret 
the missed opportunity. When you 
fail fast, you fail small. 

Personalize 

I needed to overcome my reserve. 
Sometimes I look back to when I was 
reluctant even when I knew I could 
add value, then regret the missed 
opportunity. Now I tell myself, “You 
don’t want to miss this. Get in there.” 
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Develop New World Skills 

Move to Being 

Move from being a technophobe to 
seeing that in a networked world, value 
comes from who you are connected to. 

Move to Doing 

Move from the illusion of control to invit¬ 
ing others to join with you. Your best 
selling machine can be validation from 
customers in your network. 

Benefit/Outcome 

Transform your sphere of influence 
and turn your network into an asset 
that predicts future actions, needs, 
and solutions. 

Personalize 

The Obama campaign understood the 
power of a decentralized network of 
people who shared a passion for change 
in the political systems. They were given 
access to key tools, information, and the 
freedom to use them. 


Empower Teams 

Move to Being 

Move from going it alone to form¬ 
ing partnerships, because teams 
with multiple points of view create 
diverse solutions. 

Move to Doing 

Move from fear of criticism to recog 
nizing tension as an important part 
of the creative process. Give critics a 
voice, and they’ll become advocates. 

Benefit/Outcome 

Partnerships allow you to share 
risk, fill in capability gaps, and 
focus expertise. 

Personalize 

I believed I had to do it all myself 
and didn’t ask for help. I learned 
that you have to invite others in and 
that it’s okay to admit you need 
help. People want to help and be 
part of something bigger 
than themselves. 



Unleash Your Passion 

Move to Being 

Move from lacking passion to 
encouraging passion—yours and 
theirs. Lack of passion stalls ideas, 
so start and end with passion. 

Move to Doing 

Move from personal passion to 
shared passion blended with com¬ 
passion—it creates an energy that 
propels projects and meets needs. 

Benefit/Outcome 

You create an energy that builds on 
itself, that creates momentum and 
engagement from others. 

Personalize 

I’ve learned that sometimes my pas¬ 
sion can overwhelm others, espe¬ 
cially if it borders on aggressiveness. 
I’ve had to let ideas germinate and 
encourage others to add to them 
and make them their own. 
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Most audience members are comfortable with the view from their own 
perspective and don't like to admit there might be another valid perspective 
out there. When you propose your idea, it forces them to make a decision 
to either adopt your idea or live with the consequence of refusing to adopt 
your idea. 

To ensure your idea is adopted, it’s important to have a plan—a definitive 
destination. Determining the destination involves creating a big idea (with 
the stakes articulated). You also need to plan out the audience journey of 
where you want them to move from and where you want them to move to. 

They will possibly (okay, most definitely) initially react to your proposed 
change with resistance. Address the resistance and risks involved so their 
fears are pacified and they are willing to jump in. 

Make sure the benefit is clear to them. You're persuading them to change, 
and there has to be something in it for them, their organization, or mankind 
to make it worthwhile. 
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Everything and the Kitchen Sink 


It’s now time to collect and create information. Resist 
the temptation during this initial phase to sit down with 
presentation software; it’s not quite time for that yet. 

This chapter covers various idea-generation techniques. 
It’s rare that the first, most obvious idea generated is 
the best one. Tenaciously generate ideas along a theme 
until you’ve exhausted all possibilities. Usually, the truly 
clever ideas appear in the third or fourth round of 
idea generation. 

You will use divergent thinking—the mental process 
that allows idea creation to move in any direction you 
can imagine. Divergent thinking enables new, original 
content to emerge. This is a messy phase, so suspend 
neatness and allow yourself to stay unstructured— 
you’ll be scouting for new ideas and mining existing 
ones. Broadening the amount of possibilities creates 
unexpected outcomes, so explore every solution and 
suspend judgment. 

Generate as Many Ideas as Possible: 

• Idea collection: While you can avoid starting from 
scratch by collecting presentations from peers, 
that’s not the only type of information out there; 
and regurgitating someone else’s slides is not the 
best way to connect with your audience. Collect 
readily available ideas—but more importantly, 
purposefully mine for inspiration from all other 
relevant resources. 

When panning for gold, prospectors scoop up a 
pan full of dirt and swish it around until the heavier 


and more valuable gold settles to the bottom— 
never knowing which pan full of dirt will yield a 
great nugget. So scoop “dirt” from everywhere 
during the idea-collection phase. Look at industry 
studies, competitor insights, news articles, partner 
programs, surveys— everything. Go both wide and 
deep. Gather as much as possible about the com¬ 
petitor’s messages so you can position yourself dif¬ 
ferently than they do. Find out everything about the 
subject, and roam into tangential topics for insights. 

• Idea creation: Inventing new ideas is a different 
process from mining existing ones. This is where you 
need to think instinctively—from your gut. Be curious, 
take risks, be persistent, and let your intuition guide 
you. Draw from your creative side to generate ideas 
that have never existed or been associated with your 
big idea before. Recognize that when probing into 
what’s possible, your ideas will exist in a bit of a fog— 
because you can only see the future dimly. Approach 
this in an open-minded state—one in which you’ll 
explore the unknown. You’re experimenting, risking, 
dreaming, and creating new possibilities. 

Grab a sheet of paper or a stack of sticky notes and jot 
down everything you can imagine that supports your 
idea. The goal is to create a vast amount of ideas, and 
you’ll be prompted to add even more over the next 
several pages! But don’t worry; you’ll filter, synthesize, 
and categorize all of them and craft a meaningful 
whole later on. 
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More Than Just Facts 


Now that you have begun to collect and create content, this first batch 
you brainstormed might be primarily comprised of facts. Facts are one 
type of content to collect—but they’re not the only type needed to create 
a successful presentation. You must strike a balance between analytical 
and emotional content. Yes, emotional. This might not be a step with which 
you’re comfortable, but it’s an important one nonetheless. 

Aristotle claimed that to persuade, one must employ three types of 
argument: ethical appeal (ethos), emotional appeal (pathos), and logical 
appeal (logos).' Facts alone are not sufficient to persuade. They need to 
be complemented with just the right balance of credibility and content 
that tugs at the heartstrings. 


ETHICAL APPEAL 

Garner respect through 
credibility and character 


EMOTIONAL APPEAL 

Stir emotions and imagi¬ 
nation of the audience 



LOGICAL APPEAL 

Provide evidence through 
words, structure, and data 
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Stating fact after fact in an hour-long presentation doesn’t signal to the audi¬ 
ence why these facts are important. Use emotions as a tool to bring emphasis 
to the facts so they stand out. If you don’t, you’re making the audience work 
too hard to identify the decision they are to make. Staying flat and factual 
might work in a scientific report but simply won’t work for the oral delivery 
of persuasive content. 


ETHICAL APPEAL 

Connect with the audience through 
shared values and experiences. 

Create the right balance of analytical 
and emotional appeal; this will bolster 
your credibility. The audience will feel 
connected to and have respect for 
your idea. 


LOGICAL APPEAL 

Develop a structure to keep the 
presentation intact and help it 
make sense. Make a claim and 
supply evidence that supports the 
claim. It is necessary to use logical 
appeal in all presentations. 


EMOTIONAL APPEAL 

Stimulate your audience through 
appeals to their feelings of pain or 
pleasure. When people feel these 
emotions, they will throw reason out 
the window; people make important 
decisions based on emotion. 


“The heart has its reasons which reason knows not of.” 

Blaise Pascal 2 
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Randy Olson’s Four Organs of Communication 3 


THE HEAD 

The head is the home for brainiacs. It’s characterized 
by large amounts of logic and analysis. When you’re 
trying to reason your way out of something, that’s all 
happening in your head. Things in the head tend to 
be more rational, more “thought out,” and thus less 
contradictory. “Think before you act” are the words 
analytic types live by. 


Spontaneity and intuition reside down in these lower 
organs. They are at the opposite end of the spectrum 
from cerebral actions. And while they bring with 
them a high degree of risk (from not being well 
thought through), they also offer the potential for 
something magical. 


THE HEART 

The heart is the home for the passionate ones. People 
driven by their hearts are emotional, deeply connected 
with their feelings, prone to sentimentality, susceptible 
to melodrama, and crippled by love. Sincerity comes 
from the region of the heart. 

THE GUT 

The gut is home to both humor and instinct (having 
a gut feeling about something). You’re a long way 
away from the head now, and, as a result, things are 
characterized by much less rationality. People driven 
by their gut are more impulsive, spontaneous, and 
prone to contradiction. Gut-level types say, “Just do 
it!” Things that reside in the gut haven’t yet been 
processed analytically. 

THE GROIN 

At the bottom of our anatomical progression is the 
groin. Countless men and women have risked and 
destroyed everything in their lives out of passion. 
There is no logic to these organs. You are a million 
miles away from logic in this region, and yet the 
power is enormous and the dynamic universal. 




Don’t Be So Cerebral 


People are more conditioned to generate content from 
their heads, because institutions encourage and reward 
employees who spend most of their time in their ana¬ 
lytical region (head), so most people avoid the emo¬ 
tional region (heart, gut, and groin). Yet it’s from this 
more emotional region that hunches, hypotheses, and 
passions are generated—big ideas need those too. 

Whatever your natural communication tendency is, you 
need to learn skills in the other regions to appeal to a 
broad audience. If you speak solely from the analytical 
region, move a bit lower; many decisions are made from 
emotion. In fact, your next investor might make financial 
decisions by following his heart. But if you communicate 
only from the emotional region, an analytical-driven 
audience won't buy into your lack of proof, which could 
ruin your credibility. 


What is it like to create presentations from your whole 
self—both analytical and emotional? 

Ideas generated from lower regions are more innova¬ 
tive; they’re bolder and riskier, but also more interesting. 
Abandon the spreadsheets and matrices and imagine 
what could be. Let your lower regions guide idea genera¬ 
tion, and venture into more exciting adventures. Imagine 
the unknown without feeling silly about it. After you’ve 
exhausted these yet-unfamiliar places, turn to your head 
to analyze them. Make an intentional attempt to move 
back and forth from the head to the gut to ensure that 
you’re using integrative thinking. 4 

“Emotions and beliefs are masters, reason their servant. 
Ignore emotion, and reason slumbers; trigger emotion, 
and reason comes rushing to help.” 


Henry M. Boettinger 5 


Create Meaningful Content 103 



Contrast Creates Contour 


People are naturally attracted to opposites, so presenta¬ 
tions should draw from this attraction to create interest. 

Communicating an idea juxtaposed with its polar oppo¬ 
site creates energy. Moving back and forth between the 
contradictory poles encourages full engagement from 
the audience. 

Taking a strong and clear position opens up the oppor¬ 
tunity for others to come up with a compelling counter¬ 
position, creating contrast. For each claim you make, 
the odds are high that there is a polar opposite claim 
that someone in the room supports. Of course, you 
believe that your perspective is the correct one—yet 
others in the room will likely differ. 

The gap between what is and what could be is estab¬ 
lished through creating contrast. Most people jump 
right to describing what the world looks like today (or 
historically) versus what it could be tomorrow. That’s 
the most obvious type of contrast. But it could also be 
“what the customer is like without your product’’ ver¬ 
sus “what the customer could be with your product.” 
Or “what the world looks like from an alternate point 
of view” versus “what the world looks like from your 
point of view.” Basically, the gap is any type of contrast 
between where the audience currently is and where 
they could be once they know your perspective. 
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Addressing alternate points of view and contrasting 
perspectives is not only thorough; it’s interesting—and 
there’s proof. 

In a 1986 article in the American Journal of Sociology, 
John Heritage and David Greatbatch analyzed 476 politi¬ 
cal speeches in Britain and studied what preceded the 
applause. They wanted to figure out, for example, why a 
speech could be received in total silence, whereas other 
speeches were applauded nearly twice per minute. 
What was it that appealed to the audience enough to 
evoke the physical response of clapping? After studying 
over nineteen thousand sentences, half of the bursts of 
applause could be attributed to a moment in the speech 
where a form of contrast was communicated. The role 
that contrast plays in generating a response from the 
audience was quite evident. 6 

The exercise on the next page will help you broaden your 
own perspective and create room for you to consider 
and address the audience’s alternate beliefs. Confronting 
their perspective gives you credibility; you’ll even hear 
opponents say things like, “Wow, that was thoroughly 
thought-out.” 


Create Contrast 

Review the ideas you’ve brainstormed so far. Each one 
of those ideas should have a contrasting idea inherent to 
it. There is an intelligent counterargument to each point 
you make. It’s important to explore them all. You might 
not use them, but as part of your preparation, you should 
know what they are. 

To the right is a list of contrasting elements to serve as a 
springboard. Most of your ideas possibly fall in one column 
or another. Look at all the elements in the list and gener¬ 
ate new ideas you might not have considered. Create 
opposing ideas for each point that you can think of. Do 
this exercise for the items in each column and then repeat 
the process in the reverse order, which could trigger more 
ideas. When done, you should end up with a nice, hefty list 
of contrasting perspectives. 


WHAT IS 

WHAT COULD BE 



Alternate point of view 

Your point of view 

Past/Present 

Future 

Pain 

Gain 

Problem 

Solution 

Roadblocks 

Clear Passage 

Resistance 

Action 

Impossible 

Possible 

Need 

Fulfillment 

Disadvantage 

Advantage (Opportunity) 

Information 

Insight 

Ordinary 

Special 

Question 

Answer 


Contrasting the commonplace with the lofty transforms 
audiences toward what could be. These thematic ideas are 
what creates the shapeliness of the up-and-down pattern in 
the presentation form. JTTLT 
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Transform Ideas Into Meaning 


So far, you've generated and collected ideas. Now 
you’ll give those ideas meaning. The structure and sig¬ 
nificance of stories transforms information from static 
and flat to dynamic and alive. Stories reshape informa¬ 
tion into meaning. 

The brain processes information and associates mean¬ 
ing to it. This mental process of attaching meaning 
helps us categorize information, make decisions, and 
determine something’s worth. People place value on 
relationships and material goods depending on the 
meaning they bring. 

Trying to persuade by stating the features and specifi¬ 
cations of your subject matter, product, or philosophy 
is meaningless—until you add a human to the mix. Take 
something like a medical device. The design may be 
lovely and the alloy strong—but the attribute that cre¬ 
ates meaning is that it saves lives. Could there be a story 
to tell about how the device is used to save a life, or 
even a doctor’s time? Features become valuable when 
they impact a human. That’s where the meaning lies. 

Stories help an audience visualize what you do or 
what you believe; they make others’ hearts more 
pliable. Sharing experiences in the form of a story 
creates a shared experience and visceral connection. 

The rest of this chapter focuses on how to make infor¬ 
mation meaningful and, as a result, make the audience 
more receptive to the ideas you are communicating. 

“Stories are the currency of human relationships.” 

Robert McKee 8 


You undoubtedly have items in your garage that you’re 
hanging on to that are precious to you but would be 
meaningless to others. I have those too. 

When my Gram passed away, she had nothing in her 
home of seeming material value. She was a smart, 
quick-witted lady who’d won awards for her poetry 
and lived a simple life in a tiny house on an orchard. 
When the dreaded task of dividing up her belongings 
came, I knew what I wanted: one of her small stained 
teacups. This seemingly valueless trinket would be 
worthless at a yard sale, yet it was precious to me. Not 
because of the craftsmanship or design but because of 
how and when it was used. I could visit Gram for hours, 
sipping from that cup as she told stories. The resale 
value of the teacup is less than a nickel, yet at the same 
time, to me the value of the cup is priceless. 

The value of one’s belongings or even their life is not 
based on what it physically is; the real value comes from 
the meaningfulness associated with it by another person. 
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Recall Stories 


Most great presentations use personal stories. As you 
create the content, there will be places where you want 
the audience to feel a specific emotion. Recalling a time 
when you had that very same emotion connects the 
audience to you in a credible and sincere way. Creating a 
personal catalog of stories associated with various emo¬ 
tions is a useful resource. 

One instinctual way to recall stories is to reflect upon a 
timeline of your life. You can go year by year or cluster 
the years into phases like early childhood, elementary 
age, middle school, high school, college, career, parent¬ 
ing, grandparenting, and retirement. 

However, drumming up memories based on chronology 
is only one way to do it. Breaking the chronological pat¬ 
tern can help recall a deeper—and possibly dormant- 
set of stories. Think about people, places, and things 
instead. As you explore these areas, draw sketches 
of what you see and jot down as many memories and 
emotions triggered as possible. 

• People: You can evoke relational memories by 
capturing a list of the people you’ve known. Start 
by creating a hierarchical family tree that displays 
familial bonds. Then, begin connecting and linking 
relatives to each other outside of the hierarchical 
lines based on exchanges or situations in which 
they interacted in some way. List other people 
you’ve known who’ve influenced you and relation¬ 
ships you’ve observed: teacher/student, boss/ 
co-worker, friend/enemy. These kinds of power 
dynamics make exciting stories. Think through the 
relational dynamics and feelings you have toward 
each person. 
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• Places: Carefully think about spaces where you’ve 
spent time: homes, yards, offices, neighborhoods, 
churches, sporting facilities, vacation sites—any 
place, even virtual spaces. Use your memories of 
these to transition into spatial recollection. Mentally 
move from room to room, drawing as many details 
as you can remember. You’ll "see” things you’d for¬ 
gotten. Visually moving from one space to another 
will trigger scenes and even long-disregarded scents 
and sounds. Changing gears to sketching allows you 
to use a different part of your body and brain, which 
can loosen more memories. 

• Things: Try to catalog the material things you’ve 
possessed in your life that you deem valuable. They 
don’t have to have been expensive items—just sen¬ 
timentally significant. Why were they so precious to 
you? Did you love your old jalopy because you had 
your first kiss there? Or your old teddy bear because 
it comforted you when you had your tonsils out? 
What are the stories behind these items that make 
them important to you? Sketch a picture of them 
with as much detail as possible in the environment 
where they were usually found. This will trigger even 
more emotions and memories. 

Sketching these memories is a great way to classify and 
recall stories. If you’re uncomfortable sketching, find 
images to represent the stories. Create a visual trigger 
and jot down as much of the memory as you can—espe¬ 
cially how you felt as the story unfolded. You can refer¬ 
ence this collection of stories whenever you need to tell 
a personal anecdote with conviction. 


When I get creatively stuck, I bounce back and forth 
between writing and visualizing. This process sparks 
new ideas, metaphors, or visual explanations. 

I once needed a story for a presentation that communi¬ 
cated staying calm under pressure. I wanted to draw from 
a real childhood memory. Instead of recreating my youth 
chronologically through a timeline, I drew the floor plan of 
my childhood home to trigger visual memories. My brain 
traveled through each room, recalling dormant memories 
of my lost turtle, stage productions in the basement, and 
other vivid images. 

But most importantly, I found my story. While drawing 
the floor plan of the upstairs, a memory of my four-year- 
old little sister, Norma, came flooding in as I sketched a 
closet door. She’d accidentally locked herself in the closet. 
The lock was made in the early 1900s and was on the 
inside of the closet. It had a difficult two-step process that 
involved turning a dial and moving a lever sequentially to 
open it. I felt helpless and clawed at the door from the 
outside while she screamed on the inside. My grandfather 
ran off mumbling something about finding the ax. Images 
of a bloody mess shot through my mind; I had to do 
something. I quieted Norma down enough to explain the 
choice of having Grandpa hack the door down or calming 
down and listening to my instructions. On her tiptoes, she 
carefully turned the knob, pressed the switch, and was 
freed just as Grandpa ran back into the room. I knew she 
could do it, but only with calm, persistent determination. 
The story worked perfectly! 
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Turn Information Into Stories 


Stories strengthen presentations by adding meaning. 
Used well, stories, analogies, and metaphors help create 
significance and stimulate the senses. Stories can be 
one sentence long or weave through an entire presenta¬ 
tion as a theme (page 156). 

Stories are easy to repeat. Transforming information 
into an anecdotal format charges the information emo¬ 
tionally and puts it into a readily digestible format. 


Below is a template that uses a shortened version of 
The Hero's Journey. 9 You can add as many details and 
descriptive flourishes with which you're comfortable, 
but the basic structure remains sound. Think about 
what types of information help illustrate your point 
best and turn some of that information into a story 
format. To the right are examples of how the template 
below transformed information into story. 


Short Story Template 10 


BEGINNING 

When 

Once upon a time 
In 1993 

Two months ago 
Years ago 
In ten years 


Transition 

there was 
I heard about 
I bought 
I saw 

there will be 


Who/What 

a manager 
a person (name) 
a computer 
a car 
an event 


Where 

in marketing 
in Singapore 
on eBay 
in a garage 
somewhere 


MIDDLE 




Context 

Conflict 

Proposed Resolution 

Complication 

At the time 

Which put us in conflict with 

So 

(Optional but effective) 

This was happening 

We knew that couldn't continue 

We tried this 

• What risks were there? 


The results weren’t acceptable 


• Were you worried? 

• What if it failed? 


END 

Actual Resolution MIP (Most Important Point) 

In the end ... (doesn't What's the moral or core message? 
have to be positive) 
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STORY ABOUT ORGANIZATIONAL CHANGE 11 STORY ABOUT CUSTOMER INTEREST 


POINT YOU 

WANT TO MAKE 


Every cross-divisional function could 
benefit from a steering committee. 

Midsized companies would save money 
if they bought this software. 

BEGINNING 

When, Who, 

A few years ago, the sales team tackled 

Last year 1 met with Susan, the CEO from 


Where 

a problem that demonstrates the cross- 
divisional issues I’m talking about. 

a company very similar to yours. 

MIDDLE 

Context 

At the time, all sales groups 
were independent. 

She was strategically wicked-smart, and, 
just like you, she was curious whether our 
software could help her business. 


Conflict 

This means we were confusing the 
customers with many different rules, 
processes, and formats. 

She knew that her organization wouldn’t 
scale if she didn’t have software that 
worked in a global environment. 


Proposed 

So we decided to create a sales- 

We installed a trial version for the 


Resolution 

steering committee. 

employees in the Dallas office only. 


Complication 

You can imagine how hard it was to 
reach agreement on anything. 

She was concerned that the employees 
would have a dip in productivity while 
learning a new program. 

END 

Actual 

But we agreed to meet every two weeks 

Instead, employee productivity increased, 


Resolution 

to discuss common ground. Over the 
next year, we standardized all our pro¬ 
cesses and learned a lot from each other. 
The customers were much happier with 
our service. 

and Susan received numerous e-mails 
about how the software will help them 
gain market advantage. 

It took her less than a week to agree to an 
organization-wide installation. 

MOST 


1 think every cross-divisional function 

Your company has the same challenges 

IMPORTANT 


could benefit from a steering committee. 

and would benefit, too. 


POINT 
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Case Study: Cisco Systems 

Hop to It 


Technology is meaningless until you understand how 
humans use it and benefit from it. This is often the 
conundrum in presenting technology. The emphasis 
is placed on the object and its features rather than on 
how it will help the user. 

Consider the original slide to the right and the original 
script that accompanied the slide. Though it initially 
seems to describe the human component, it's really 
nothing more than a laundry list of capabilities. 

This description is accurate, succinct, and completely 
devoid of charm or character. It answers the questions 
“what” and “how" while completely ignoring the “why.” 
In other words, technology is capable of many things— 
but audiences need to be given a reason to care. 

That reason to care starts with the story. Paint a 
picture; provide a human element to which the audi¬ 
ence can relate; tell them “why.” Once you have them 
hooked, you can pull back the curtain and show them 
how the technology really works. You will lose an 
audience if you jump into how a magic trick works 
without first performing the jaw-dropping trick itself. 

The story on the following pages transforms the origi¬ 
nal presentation by capturing how Cisco’s technology 
helped a small-business man become more agile and 
smart in managing his business. 

When your company's tagline is “the human net¬ 
work," telling how humans benefit from this network 
is important. Weaving it into a story with a real char¬ 
acter is even better. 
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ORIGINAL SLIDE 



Joint Solution composite Applications Cisco Ol Solution 


Enterprise / SP 


Process Integration 
Information Integration 


Enterprise 


Distributed Real-time Operational Intelligence (Cisco) 


Events Rollup (Ol) 


ORIGINAL SCRIPT 


"Here's an example of the power of Unified 
Communications in manufacturing. 

The team can enter the meeting via a Cisco IP 
touch-screen phone or via the telephony user 
interface on their cell phone. 

The meeting can easily move from a simple audio 
conference to a web conference if documents 
need to be shared, and also to a video conference 
if video content (such as a real-time view of the 
machinery on the line) needs to be reviewed to 
solve the problem.” 




















STORY STRUCTURE 


Introduce your hero early—and give 
your audience a reason to root for 
him or her. 


“HOP TO IT” STORY 



Dave is president of a large microbrewery. 
He’s won more regional beer competitions 
than anyone else, and he’s hungry for his 
next one, confident that his award-winning 
recipe will land him another victory. 


Set up the conflict clearly, but don’t 
reveal how the hero will overcome 
it—that’s part of the mystery. 



Unfortunately, while gearing up to brew 
a batch of his new beer for the competi¬ 
tion, he discovers his secret ingredient, 
his prize hops, hasn’t arrived. 


Provide the audience more infor¬ 
mation about the nature of the 
challenge; often this comes from 
unexpected sources or new 
characters. 



Just then, Dave’s supply-chain manager 
receives a notification—the shipment 
of hops has been delayed in customs. 
The network detected the message and 
routed it to Dave’s brewing company, 
where a text message alerts Dave’s 
supply-chain manager. 
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STORY STRUCTURE 


Develop a complication. There’s 
nothing like raising the stakes. 


Reveal a solution, but make sure it’s 
not an easy one. This secondary chal¬ 
lenge raises the stakes for the hero 
and keeps the audience on edge. 

Bring the story to a head, laying out 
all the stakes for all the characters. 


When picking up the story, recall the 
original premise to refresh the audi¬ 
ence’s memory. 



Now Dave has big problems. His hops 
haven’t arrived, and there’s no telling 
how long they’ll be held up in customs. 
He needs to launch his new brew at the 
competition because he’s depending on 
the press coverage to make it a top seller 
this year. And he’s shut down a big part 
of his operation in anticipation of the 
event, so he’ll lose revenue if he can’t 
make his deadline. 


But there may be a solution: On the other 
side of the country, another hop supplier 
growing the same variety has had a 
bumper year and needs to unload his 
product before it goes bad. What will 
happen to Dave? Will he be able to 
defend his title? Will the competition 
organizers be able to draw the crowds 
they need? Will the alternate hop 
supplier find his customer? Find out 
in the exciting conclusion... 


When last we left our heroes, all was not 
well. Fortunately, when the shipment was 
stopped at customs, Dave and his team 
were notified immediately. 


Cliffhanger: It’s at this point in the story you can break to explain how the tech¬ 
nology works. The audience will be left in a state of suspense, wondering what 
becomes of the characters, while you provide background on the solution. This 
serves two purposes: You privilege your audience with information the charac¬ 
ters don’t have, and you provide the hard data you must share. 
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For stories with multiple characters 
or crises, a step-by-step approach 
makes the ultimate solution simple 
and believable. 



The production manager determines 
the exact shortage based on the new 
recipe, then checks potential sources at 
other key suppliers through his secure 
network connection. 


Build to the resolution. Show the 
incremental steps in overcoming 
the challenge. 

Create a climax wherein all the 
story threads are resolved except 
for one—the original challenge. 



He identifies the alternate hop supplier, 
indicates the needed quantity, verifies 
the variety, and places an order. 

The grower’s sales rep receives the 
order, finds the available production 
supervisor, and clicks to connect to 
him—via multiple devices—confirming 
that he can ship the hops right away. 
The domestic supplier confirms the ship 
date with Dave, who is able to confirm 
his participation in the competition... 


Let that resolution be the final 
movement, the scene that lifts the 
hero from one state to another. 



...which, of course, he wins again. 
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Move from Data to Meaning 


Numbers can be captivating if you move beyond just 
spouting the data. According to Now You See It author 
Stephen Few, “As providers of quantitative business 
information, it is our responsibility to do more than sift 
through the data and pass it on; we must help our read¬ 
ers gain the insight contained therein. We must design 
the message in a way that leads readers on a journey of 
discovery, making sure that what’s important is clearly 
seen and understood. Numbers have an important story 
to tell. They rely on you to give them a clear and convinc¬ 
ing voice.” 12 

Numbers rarely speak for themselves. How big is a bil¬ 
lion? How does that figure compare to others? What 
causes the numbers to go up or down? You can leave 
it up to individual interpretation, or you can explain the 
bumps, anomalies, and trends by accompanying them 
with narrative. 

There are a few ways to explain the narrative in 
the numbers: 

• Scale: Nowadays, we casually throw around pro¬ 
foundly large (and minutely small) numbers. Explain 
the grandness of scale by contrasting it with items 
of familiar size. 

WaterPartner.org’s 2008 animation: “This year, 

1 white girl will be kidnapped in Aruba, 4 will die 
in shark attacks, 79 will die of Avian flu, 965 will 
die in airplane crashes, 14,600 will lose their lives 
in armed conflict, 5,000,000 will die from water- 
related disease. That’s a tsunami twice a month or 
five Hurricane Katrinas each day, or a World Trade 
Center disaster every four hours. Where are the 
headlines? Where is our outrage? Where is our 


humanity?” www 


• Compare: Some numbers sound deceptively small or 
large until they’re put into context by comparing them 
to numbers of similar value in a different context. 

Intel's CEO Paul Otellini’s 2010 CES Presentation: 
“Today we have the industry’s first-shipping 32- 
nanometer process technology. A 32-nanometer 
microprocessor is 5,000 times faster; its transistors 
are 100,000 times cheaper than the 4004 processor 
that we began with. With all respect to our friends 
in the auto industry, if their products had produced 
the same kind of innovation, cars today would go 
470,000 miles per hour. They’d get 100,000 miles 
per gallon and they’d cost three cents. We believe 
that these advances in technology are bringing us 
into a new era of computing.” 

• Context: Numbers in charts go up and down or get 
bigger and smaller. Explaining the environmental and 
strategic factors that influence the changes gives the 
numbers meaning. 

Duarte Founder Mark Duarte’s Vision Presentation: 
When rolling out the 2010 vision, Mark showed a 
graphic depicting four bold strategic moves the organ¬ 
ization had taken every five years since its founding 
twenty years ago. He explained how each strategic 
span of five years formed the corporate values. Then, 
he overlaid historic revenue trends over the same five- 
year increments showing how Duarte weathered each 
economic storm, emphasizing the role each strategic 
surge created in growth and opportunity. There was 
little resistance in understanding why the next five-year 
plan was worth supporting. 

Telling the narrative implied in the numbers helps others 

see the meaning of the numbers. 
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Murder Your Darlings 


Now that you've amassed all the analytical and emo¬ 
tional content possible, it's time to narrow it down. 
Many of the ideas are unique and were possibly fasci¬ 
nating to uncover. But you can’t say it all—and no one 
wants to hear it all. The ideas need to be filtered down 
to the points that succinctly support your big idea. 

The pages in this chapter have walked you through 
divergent thinking by generating ideas. You collected 
factual and emotional content and considered contrast¬ 
ing perspectives. 

Now it's time for some convergent thinking. Divergent 
and convergent were identified by J. P. Guilford in 1967 


as two different types of thinking that occur in response 
to a problem. Divergent thinking generates ideas, while 
convergent thinking sorts and analyzes these ideas 
toward the best outcome. 

So hopefully, all the ideas you just generated give you 
some great creative choices to sift through. 

In his book Change by Design, Tim Brown says, “Convergent 
thinking is a practical way of deciding among existing 
alternatives. Think of a funnel, where the flared opening 
represents a broad set of initial possibilities and the small 
spout represents the narrowly convergent solution.” 13 



“In the divergent phase, new options 
emerge. In the convergent phase, it 
is just the reverse: Now it’s time to 
eliminate options and make choices. 
It can be painful to let a once- 
promising idea fall away.” 

Tim Brown 14 
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Although you may feel that all the ideas you generated 
are insightfully riveting and took a ton of time to gener¬ 
ate, they need to be sorted and organized—and some 
ideas need to be killed off. Killed? Yes; and the best filter¬ 
ing device you have is your big idea itself. Review it again, 
and eliminate all the fodder you captured that doesn’t 
distinctly support that one big idea. 

It’s a violent creative process to construct ideas, destroy 
them, group them, regroup them, select them, reject 
them, rethink them, and modify them. Use both divergent 
and convergent thinking processes repeatedly until you 
have the most salient content to support your big idea. 

When you feel that you have firmly established your posi¬ 
tion and filtered your ideas, review page 105 and validate 
that you retained enough interesting contrast. You don’t 
want contrast to hit the cutting-room floor during the 
vetting process. 

Filtering is very important. If you don’t filter your 
presentation, the audience will respond negatively— 
because you’re making them work too hard to discern 
the most important pieces. While they are listening, 
they are determining in their minds what was interesting 


versus what was superfluous. And given the current 
social media environment, they have a forum to—very 
publicly—let others know their impression of your pre¬ 
sentation. Their feedback can be brutally honest too. So 
if you don’t edit it, the audience will be frustrated, and 
they might have the creative chops to distribute their 
thoughts to thousands of their social network followers. 
Make edits on behalf of the audience; they don’t want 
everything. It's your job to be severe in your cuts. Let 
go of ideas even if you love them, for the sake of making 
the presentation better. 

Audiences are screaming “make it clear,” not “cram 
more in.” You won’t often hear an audience member say, 
“That presentation would have been so much better if 
it were longer.” Striking a balance between withholding 
and communicating information is what separates the 
great presenters from the rest. The quality depends just 
as much on what you choose to remove as what you 
choose to include. 

“Whenever you feel an impulse to perpetrate a piece of 
exceptionally fine writing, obey it—whole-heartedly—and 
delete it before sending your manuscript to press. MURDER 
YOUR DARLINGS.” 


Sir Arthur Quiller-Couch 15 
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From Ideas to Messages 


Now that you’ve edited down the content, you’re going 
to cluster it by topic and then turn the topics into dis¬ 
crete messages. Grab a fresh piece of paper or a stack 
of sticky notes and write out the three or so major topics 
that support the big idea and spread them out, giving 
them breathing room. The important points should be 
top-of-mind after all the research you’ve done, but if 
you’re struggling to limit them to five, it might take a bit 
of mental negotiation to murder another darling or two. 

Each topic should overlap as little as possible. Make 
sure that nothing relevant to your big idea has been 
overlooked. There’s a thinking process commonly used 
at McKinsey called MECE (Mutually Exclusive and 
Collectively Exhaustive): 

• Mutually Exclusive: Each idea should be mutually 
exclusive and not overlap with the others; other¬ 
wise you will confuse the audience. (“Hey, haven’t 
we talked about the acquisition already?”) 

• Collectively Exhaustive: Don’t leave anything out. 

If you plan to talk about your competitors, you 
should not mysteriously leave one out. The audi¬ 
ence expects you to be complete. 

Once you’ve nailed down the key topics, list three to 
five supporting ideas around each. To the right is an 
example from a presentation announcing an acquisi¬ 
tion that would be delivered at an employee meeting. 
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The topics you initially generate are usually a single 
word or a sentence fragment. In the same way that a 
big idea shouldn’t be a topic, these little ideas need 
to be transformed into messages as well. Again, a 
message should be a full sentence that’s emotionally 
charged. Topics are neutral; messages are charged. 

Now that you’ve created clusters of ideas around the 
topics, you’re going to transform the topic into a key 
message for each cluster. 

Each message should feature as much contrast as 
necessary to effectively communicate the point. 

In the acquisition example on the left-hand page, the 
first acquisition failed. They shouldn’t jump right into 
discussing the new acquisition (what could be) with¬ 
out acknowledging the first failed acquisition (what 
is). The message of the new acquisition must include 
an acknowledgment of what was learned from the 
previous failings, or the audience will feel like this new 
acquisition will fail also. 

Changing topics into messages ensures that the con¬ 
tent supports one big idea and that each message has 
an emotional charge to it. In the next chapter, you’ll be 
arranging and structuring these messages. 


Here are examples of changing the topics on the 
previous page to messages: 


TOPIC 

MESSAGE 

Market 

We have an aggressive competitor 
grabbing market share. 

Acquisition 

This acquisition will be successful 
because we applied insights from 
the last one. 

Operations 

Operations will pay the biggest price, 
so let’s all support them well. 

Culture 

Our culture is valuable and will 

be strengthened through this 
historic change. 
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The big idea is the well from which all supporting ideas spring, and it 
is also the filter to sort ideas down to the ones most applicable. Most 
presentations suffer from too many ideas, not too few. 

Even though you explored hundreds of potential ideas and left no rock 
unturned, don't convey every idea, only the most potent ones. 

Keep a stranglehold on the one big idea you need to convey and be 
relentless about building content that supports that one idea. 
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Establish Structure 


Now that you’ve created meaningful messages, how 
do you arrange them for maximum impact? You struc¬ 
ture them in a deliberate and logical way. A solid struc¬ 
ture is the foundation of a coherent presentation, and 
shows the relationship between the parts and whole. 
It’s similar to the couplings on a train or the string of 
a pearl necklace; it keeps everything connected in an 
orderly fashion, as if the content were destined to fit 
together neatly within a given framework. Without 
structure, ideas are easily forgotten. 

“It’s unwise to merely dump a pile of unstructured 
information into the laps of your audience. They will 
have the same reaction as if you take a watch apart, 
fling the pieces at them and say ‘Here’s all you need 
to make a watch.’ You might get high marks for 
research and energy, but that is a low-class consolation 
prize. By doing this you confess that you don’t know 
what to do with all the stuff you’ve dug up. Audiences 
expect structure.” 


Henry M. Boettinger 1 

Most presentation applications are linear and encour¬ 
age users to create slides in a sequential order. One 
slide follows the other, which naturally compels the 
user to focus on the individual details instead of the 
overarching structure. To help your audience “see” 
the structure, move out of the linear format of the 
presentation application and create an environment 
where you can look at the content spatially. 
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There are several ways to do this. You can use sticky 
notes, tape slides on a wall, or lay them on the floor. Any 
method that pulls your content out of a linear presenta¬ 
tion application will work. Moving out of a slide-creation 
environment helps identify holes and keeps you focused 
on the bigger picture. This will help move your presenta¬ 
tion from being about a bunch of small parts to being 
about a single big idea. 

Clustering your content helps you visually assess how 
much weight you’ve given to various portions and how 
many supporting points you need to get your message 
across. Use this technique to confirm that you’re empha¬ 
sizing the correct content and allocating appropriate 
time for each message. 

Keep in mind that the structure should accommodate 
the audience’s comprehension needs and should be 
assembled in a way that’s palatable to them. It’s natural 
for subject matter experts to prepare material linking 
ideas that are closely connected in their own minds, 
but remember that the audience might not see these 
relationships as readily. Connect your messages in a 
way that your audience can follow. The structure should 
feel natural and make common sense to them! 

This section will walk you through various structural 
devices for organizing your presentation. Most presen¬ 
tations that fail do so because of structural deficiencies. 
When the structure works, the presentation works. If one 
is sound, the other will be sound. A good structure helps 
you work out the kinks and eliminate the extemporaneous. 



Make Sense 


The odds are high that you’ve been the victim of a mean¬ 
dering presentation. Unorganized presentations follow 
an invisible, neurotic pathway that only makes sense to 
the presenter. When an audience is unable to recognize 
structure, it's usually because the presenter either didn’t 
have time to organize the information or didn’t care 
enough to package the content in a way the audience 
could easily process. 

Presentations that follow rabbit trails lead nowhere and 
leave the audience lost in a confused maze of dead ends. 

Without structure, your ideas won’t be solid. Structure 
strengthens your thinking. But many presentations today 
migrate away from the purity and clarity of structure. 
Don’t fall for this temptation. 

The most widely used structure for presentations is topi¬ 
cal. A logic tree and outline are common forms to help 
visualize structure: 

OUTLINE 

Big Idea 


A. _ 

B. _ 

C. _ 

1 . 

2 . 

II. _ 

A. _ 
1 . 
2 . 

3. 

B. _ 


TREE 
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Notice how all the supporting information hangs off the 
larger topics. Points are held together under one unifying 
big idea from which the topics cascade down. 

The chief marketing officer of a public company recently 
shared with me a process modification she made while 
developing messages for her CEO. Traditionally, she and 
her team would "pitch” ideas to the CEO by firing up a 
slide show. About three slides in, he would inevitably 
throw a wrench in by letting them know that this or that 
piece of content should be included. If he’d held onto his 
shorts, he would have seen that his favorite pet content 
was there, but he wouldn’t have seen it for another 
fifteen minutes of slides. She laughed and said that the 
last time she worked with him, her team had a monu¬ 
mental idea. Ditch the slides and give him a substantial 
outline. He quickly absorbed the structure, saw his pet 
content immediately, and spent the bulk of the hour 
building on the ideas they proposed. Long live outlines! 

There are benefits to looking at a presentation’s struc¬ 
ture holistically. 

• It creates a snapshot of the structure so you’re look¬ 
ing at the whole and not the parts, which keeps you 
focused on the construct instead of the details. 

• It ensures that you have one clear big idea bolstered 
by supporting topics. 

• It filters out tangential subtopics that may fall within 
the topic but that don’t purely support the single 
big idea. 

• It helps the review team get a quick read on the struc¬ 
ture and messages, saving them time so they can give 
more thoughtful feedback. 














Organizational Structures 

There are several interesting ways to organize supporting 
content. Though the most common is topical, a presen¬ 
tation’s structure can incorporate other less customary 
organizational patterns. These patterns can be used as 
the overarching structure to replace a topical one, or to 
arrange content within a subtopic. 


These four structures have a natural storylike form that 

creates interest in presentations: 

• Chronological: Arrange information related to 
events according to their time progression (forward 
or backward). This is best used if a topic is generally 
understood in terms of when events transpired. 

• Sequential: Arrange information according to a 
process or step-by-step sequence. This is usually 
used in a report or to describe a project rollout. 

• Spatial: Arrange information according to how 
things relate together in a physical space. 

• Climactic: Arrange information in order of impor¬ 
tance, usually moving from the least to most 
important point. 


These four structures have contrast inherently built 

into them and work for persuasive presentations: 

• Problem-solution: Arrange information by stating 
the problem and then the solution. Establishing that 
there's a problem helps convince people of the need 
for change. 

• Compare-contrast: Arrange information according 
to how two or more things are different from or 
similar to one another. Insights surface when infor¬ 
mation is put into this context. 

• Cause-effect: Arrange information to show the dif¬ 
ferent causes and effects of various situations. This is 
effective when promoting action to solve a problem. 

• Advantage-disadvantage: Arrange information into 
“good” or "bad” categories. This helps the audience 
weigh both sides of an issue. 


Choose the organizational structure that makes the 
most sense for your message. Whichever structure you 
use, guide the audience through it with clear verbal or 
visual cues that clarify where you are and where you 
are taking them. 
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Case Study: Richard Feynman 

Gravity Lecture Structure 


Richard Feynman’s lectures at the California Institute of 
Technology appealed to both the heady physics majors 
and non-physics majors who simply dropped in to his 
class for fun (an unprecedented phenomenon for a phys¬ 
ics class). Feynman’s accessible communication style 
earned him the title The Great Explainer. 

In a BBC interview, Feynman explains how he organizes 
his lectures: “How should I best teach them? ,..[F]rom the 
point of view from the history of science or the applica¬ 
tion of science? My theory is...to be chaotic and confuse 
it. Use every possible way of doing it. You catch this guy 
or that guy on different hooks as you go along. So during 
the time the fellow who is interested in history is being 
bored by the abstract mathematic...the fellow who likes 
the abstractions is being bored by the history. You do it 
so you don’t bore them all, all of the time.” 2 

Feynman is able to bring contrast to his lectures because 
he has both highly developed analytical and emotional 
sides. Even though he won a Nobel Prize, designed 
a pictorial scheme for subatomic particles, assisted in 
developing the atomic bomb, and predicted nano¬ 
technology, he also regularly performed on the bongo 
drums. He believed his most prized asset was his insatia¬ 
ble curiosity instilled by his father. “My father taught 
me to notice things," Feynman said. "I’m always looking, 
like a child, for the wonders I know I’m going to find.” 3 
Humor and curiosity are the emotions that Feynman 
draws on again and again to present a fascinating—and 
balanced—view of science. 

Feynman communicated from both his head and his 
heart in each lecture. 


Analytical Devices: 

• Signal: Feynman uses organizational signals to help 
the students understand how the structural pieces of 
a lecture fit together. He states the structure at the 
beginning and uses rhetorical questions and verbal 
signals when transitioning to new points. 

• Itemize: He breaks some sections into chunks by 
stating how many points he is going to make and 
then articulating what point he will be covering as 
his lectures progress. 

• Visualize: Feynman regularly used 35 mm slides, over¬ 
heads, and the chalkboard, but he didn’t overuse them. 
He used dramatic gestures and sound effects to accom¬ 
pany his lectures instead of blackboards covered with 
esoteric symbols. 4 

Emotional Devices: 

• Wonderment: Feynman’s childlike curiosity drove him 
toward science while also influencing his lectures with 
poetic phrases of wonderment, not only for science 
but for life. Feynman didn’t just talk about physics; he 
marveled at the subject, and the magnificent beauty 
and brilliance of nature. 

• Humor: Feynman had a self-deprecating sense of 
humor and a knack for weaving in humor related to 
the subject matter. He knew that an entertaining story 
is often more readily received than a well-reasoned 
lecture. 5 He interjected humor in almost even incre¬ 
ments across his lectures. 

The sparkline on pages 132 to 133 reflects Feynman’s 

ability to employ the power of contrast, www 
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What is What could be 


Feynman’s Sparkline 


As you’ve learned, contrast is critical to holding an audience’s 
attention. Feynman’s lectures are a magnificent example of 
contrast and structure. Some academic topics simply can’t 
contrast between what is and what could be until they lay the 
foundation of what is over several lectures. 

In this lecture on the law of gravity, Feynman masterfully 
incorporates contrast by moving back and forth between fact 


(mathematics) and context (history) in nearly perfect 
timing. Technically, this sparkline should be one flat 
what is line. So we’ll pretend we’ve zoomed in on that 
line to look more closely at the contrast between fact 
and context. (See www for a visionary presentation by 
Feynman that does traverse between what is and what 
could be.) 


Feynman uses carefully crafted phrases of wonderment that express his affection for the sub¬ 
ject: "This law has been called the greatest generalization achieved by the human mind. And 
you can get already, from my introduction, that I’m interested not so much in the human mind 
as in the marvel of nature, who can obey such an elegant and simple law as this law of gravi¬ 
tation. So our main concentration will not be on how clever we are to have found it all out but 
on how clever she is to pay attention to it!” 



““ Wonderment 

Rhetorical Questions 

Speaking 
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0:10 

0:15 
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0:30 

Laughter 1 1 
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III 1 1 1 
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Organizational 
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1 1 III 1 

1 III 1 

1 1 II 1 1 

1 


Signals 


Signal the Audience 

The tick marks above represent numerous signals of how the lecture was organized. He uses three types of 


organizational signals: 6 

Introductions 

“What I want to talk to you about..." 
“I am going to try to give...” “Now 
I've chosen..." “What I’d like to do 
in this lecture.” 


New Key Points 

“First,...” "Next,...” “In the mean¬ 
time,...” “The next point...” “For 
instance,...” “Then,...” “Further,...” 
“In addition,...” “The next ques¬ 
tion is,...” "Another problem came 
up,...” “Onward!” 


Conclusions 

“So it became apparent...” “So an 
interesting proposal is made...” 
“But the most impressive fact 
is..." “Finally,...” 
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Create a Sense of Wonderment 

"It’s one of the most beautiful 
things in the sky—as good as 
sea waves and sunsets.” 


Make the Audience Think 

Feynman sprinkles rhetorical questions as 
structural devices throughout his lecture 
like these: "Now what is this law of gravi¬ 
tation that we’re going to talk about? 

The force of the moon on the earth is bal¬ 
anced, but by what? So something’s the 
matter with the law?” 


New Bliss 

“Nature uses only the 
longest threads to weave 
her patterns so that each 
small piece of her fabric 
reveals the organization 
of the entire tapestry.” 
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Engage with Laughter 

Feynman infuses his lectures with funny commentary to keep the students engaged. He 
loses his place in his notes, stumbles a bit, and makes a joke at the same time: Now that 
shows that gravitation extends to the great distances, but Newton said that everything 
attracted everything else. Do I attract you? Excuse me, I mean, do I attract you physically? 
I didn’t mean that. What I mean is...” 
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Order Messages for Impact 


Structure can be used to drive a desired outcome. Where and how you associate one piece of information with 
another creates meaning and determines how others will receive it. Skillfully arranged information creates emotional 
appeal and leads to the desired emotional impact at the end of the presentation. 

Below is an example of a third-quarter update presentation. Most organizations regularly deliver these reports to 
communicate the progress being made toward corporate goals. Notice that the “move to" states that the employees 
should feel confident and motivated to help. 


BIG IDEA MOVE FROM 

Q3 revenue is down, and we’re Unsure about the company's future 

still in the lead, but if we slow - 

down, we’ll lose market share Financial distractions breeding 

low productivity 


MOVE TO 

Confident we will succeed 


Motivated to create even 
better products next quarter 


Demotivating Structure 

This structure does not motivate the audience to feel confident that they will succeed. 


Revenue dofmnj ^ mrtef- 
/s down:js, (£ ^ 

dom up \d/o ' 


Lauwckwiq Voma well u/e m^ed 

new prodrfC-fc compared h> OUr Q3 

dvday Cowtpe-h'-br^ -fyrecaS't' 


Script 

Welcome, every¬ 
body, to the Q3 
update. I just 
want to let you 
know that the 
Q3 revenue was 
down. The rumors 
are true. 


The numbers 
are down. But 
hey, we’re up 
15 percent in 
number of new 
clients. That’s 
good. Good 
job, everyone. 


Our market 
share is also up, 
so that’s not bad. 


And you guys 
were able to turn 
out some new 
products this 
quarter, and I’m 
really proud of 
you for that. 


We’re not doing 
too bad com¬ 
pared to our 
competitors. 


All this happened in 
a quarter in which 
the analysts pre¬ 
dicted we’d be down, 
so it was expected. 
Thanks for coming 
today, and have a 
great day. 
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Motivating Structure 

Now look at the same material presented in a different order with a pinch of added emotional appeal. Simple 
structural shifts and celebratory marvel changed the presentation’s tone and outcome. Each point builds on 
the previous one, and it culminates in a motivating crescendo. 
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Script 

Welcome, every¬ 
body, to the Q3 
update. When 
the forecasters 
looked at this 
quarter, they said 
our industry—our 
company in par¬ 
ticular—was the 
little engine that 
couldn't. They 
said we wouldn't 
be able to make 
the climb. 


In spite of that, 
we have shaken 
up the market 
in a down econ¬ 
omy! Our new- 
client wins are up 
15 percent over 
last year. In fact, 
four of the new 
clients are large 
multinational 
organizations 
that have been 
on our target 
list for over 
three years! 


Yes, revenue is 
down, but let’s 
give that context: 
The economy is 
down; our indus¬ 
try tracks with 
the economy, 
and so it’s down; 
our company is 
a leader in our 
industry and 
tracks with it, 
so of course our 
revenue would 
be down. 


But how did we 
do compared 
to our competi¬ 
tors? SuperCo is 
down 12 percent. 
DuperCo is down 
8 percent. How 
far down are we? 
<pause> We 
are down only 
2 percent. 


So how has that 
impacted our 
market share? 
We have made 
significant 
gains—not only 
domestically but 
also abroad. 

Even though 
the marketplace 
has endured a 
season of chaos 
and uncertainty, 
you have made 
this season one 
of my proudest 
moments. 


Just look at the 
products we’ll be 
rolling out in Q4. 
Wow, aren’t they 
beautiful? It takes 
innovation and 
tenacity to create 
stunning products 
with this magnitude 
of market disrup¬ 
tion, and you did 
it! If you can be 
this creative in an 
uncertain environ¬ 
ment, I can’t wait to 
see what you’ll do 
when the market 
turns around. We’re 
not only the engine 
that could; we’re an 
engine that can’t 
be stopped! 


The way information is structured makes 
a difference in the outcome. 
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Create Emotional Contrast 


Audiences enjoy it when presentations convey emotional contrast and appeal; however, most pre¬ 
sentations lack this because it requires an additional step and can be an elusive element to include. 

Involving the audience emotionally helps them form a relationship with you and your message. 
According to Peter Guber, “Business leaders must recognize that how the audience physically 
responds to the storyteller is an integral part of the story and its telling. Communal emotional 
response—hoots of laughter, shrieks of fear, gasps of dismay, cries of anger—is a binding force that 
the storyteller must learn how to orchestrate through appeals to the senses and the emotions.” 7 

Moving between analytical and emotional content is another form of contrast. Remember, contrast 
is very important for keeping the audience interested. Switching between the two creates contrast. 


Presentation Content Types 

Below are two columns listing typical presentation content. Hard drives around the world are 
packed with slides from the left-hand section, but only a tiny percentage have slides from the 
right-hand section. 


ANALYTICAL CONTENT 


Diagram 

Feature 

Data 

Evidence 
Example 
Case study 


Specimen, exhibit 

System 

Process 

Facts 

Supporting 

documentation 


EMOTIONAL CONTENT 
Biographical or 
fictitious stories 

Benefits 

Analogies, metaphors, 
anecdotes, parables 

Props or dramatization 
Suspenseful reveals 


Shocking or scary 
statements 

Evocative images 

Invitations to marvel 
or wonder 

Humor 
Surprises 
Offers, deals 


Look at any of the analytical topics from the list on the left. They typically have no emotional 
charge to them—neither pain nor pleasure. Yet all could be presented in a way that transforms 
traditionally analytical material into emotional material. For example, a simple diagram of a small 
circle within a larger circle could convey that an acquisition occurred. The diagram is neutral until 
you tell the story of the struggle it took to acquire the company, or the heroics displayed by both 
parties to expedite the acquisition. Data is purely analytical until you explain why the ups and 
downs exist. 
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Contrast Analytical and Emotional Content 

Let's review the Q3 update presentation from the previous pages once again. A typical quarterly update presentation 
is full of data and reportlike material that isn’t likely to connect employees to the message. 


Here’s how the analytical information was modified in the earlier example: 


We massed 
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METAPHOR 
The little engine 
that couldn't. 


SLOW INVITATION TO 

REVEAL MARVEL 


Suspenseful 
progression and 
long pause. 


Wonderment: 
“Aren’t they 
beautiful?” 


Inventory your slides and identify any content that can 
be transformed from analytical to emotional. Change it 
wherever appropriate. 

In the movies, alternating emotion is called beats. Beats 
are the smallest structural element in a movie; there 
can be several in one scene. Scenes are analyzed to 


make sure there is a shift of emotion in each scene. 
Screenwriters carefully ensure that the emotions are 
moving between pain and pleasure so that the audience 
remains engaged. 8 

Moving back and forth between analytical and emotional 
content engages presentation audiences in the same way. 
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Contrast the Delivery 


The chronic bombardment of media and entertainment 
has transformed us into an impatient culture. The enter¬ 
tainment industry continues to churn out new, innovative 
ways to engross our minds and hearts and provide us 
with various avenues of escape. 

Audiences have become accustomed to quick action, 
rapid scene changes, and soundtracks that make the 
heart race. These advances in entertainment have set 
high expectations for visual and visceral stimulation 
and have undermined our ability to sit attentively for an 
hour while a speaker drones on. Most squirm within ten 
minutes and wish they had a remote control to flip to 
something more interesting. 

Changing up delivery methods from traditional slide 
read-along to less conventional means keeps the audi¬ 
ence interested and creates an element of surprise. Use 
alternate media, multiple presenters, and interaction 
to keep your talk alive, but be aware that these mode 
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changes need to be carefully planned. Several can and 
should occur within an hour. 

The key to getting and holding attention is having 
something new happen continually. This creates a sense 
that something is always “going on.'” Changing delivery 
modes can include physical movement on the stage. 
People feel compelled to watch visual events carefully 
because of our natural fight or flight instinct. Changes 
in media, alternating presenters, or even something as 
simple as a dramatic gesture creates variety for the 
audience and holds their interest. 

Overuse of slides diminishes the power of human con¬ 
nection. Because genuine human connection is rare, 
you should capitalize on moments when you're present¬ 
ing in person. An audience will deem a presentation a 
success if they feel they interacted with you. Lowering 
your dependency on slides helps facilitate this sense 
of connectedness. 


Varying the delivery method between traditional and less traditional 
methods creates contrast: Below is a list of delivery methods that 
contrast. You can see how delivering using nontraditional methods 
will make the presentation more interesting. 

TRADITIONAL NONTRADITIONAL 


Stage 

Be the main event • Share the main event 
Hide behind podium • Be free to roam 

Use stage as-is • Use stage as a setting 

Style 

Serious business tone • Humor and enthusiasm 
Confined expressiveness • Large expressiveness 
Monotone • Vocal and pace variety 

Visuals 

Read slides • Minimize slides 
Static images • Moving images 
Talk about your product • Show them your product 

Interaction 

Minimize disruptions • Plan disruptions 
Resist live feedback • Embrace real-time feedback 
Request silence • Encourage exchanges 

Content 

Familiarity with features • Wonderment and awe at features 
Flawless knowledge • Self-deprecating humanness 
Long-winded rambles • Memorable, headline-sized sound bites 

Involvement 

One-way delivery • Polling, shout-outs, game playing, 
writing, drawing, sharing, singing, 
and question-asking 


Use as many variations as possible to keep it interesting. Mix it up 
to create contrast! 
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Putting Your Story on the Silver Screen 


You’re finally at the last step of the presentation creation process. Now 
that all your messages are clear and structured, it’s time to storyboard 
the slides. 

Before opening presentation software, keep in mind the following: 

One idea per slide: Each slide should have only one message. There’s 
no reason to crowd several ideas onto one slide. Slides are free; make 
as many slides as you need. Give each idea its own moment on the stage. 
The audience visually re-engages each time you advance to the next 
slide, so having several well-paced slides will re-lure them visually each 
time you click. 

Keep it simple: Sketch out small visual representations of your ideas on 
paper or sticky notes. Constraining your ideas to a small sketch space 
guides you to simple, clear words and pictures (as proof of concept) 
before creating them in presentation software. Even if you don’t have 
an image, nice big type on the screen is better than dense prose. 












Turn words into pictures: Turning words into pictures is easy if you under¬ 
stand the relationship between the words on the slide. Look at one of your 
bullet slides. Each piece of content has some sort of relationship with the 
other content because when you were assembling the slide, it “felt” like 
they belonged together. Circle all the verbs or nouns on the slide and think 
through how they are all related to each other. Odds are, the relationships 
they form fall into one of the categories below. 


Various Types of Visual Relationships 9 


FLOW 

Shows process 


STRUCTURE 
Shows classification 


CLUSTER 

Shows groupings 


RADIATE 

Shows links and nodes 


INFLUENCE 

Shows cause and effect 



*i* * • 

• ■ - <S - 


. o 



Circle either the verbs or nouns in 
the bullet points and determine 
their relationship to each other. 


Note: If you want insights into how to create slides, pick up these two books: Presentation Zen by Garr Reynolds and Slide.ology by yours truly. 
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Process Recap 


If you’ve been using sticky notes to collect and 
organize your ideas over the last two chapters, 
this is what the process should look like. 


GENERATE 

IDEAS 


FILTER 

DOWN 



page 98 to 117 


page 118 to 119 


CLUSTER 



Cluster ideas by topic. 


page 120 to 121 
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CREATE 

MESSAGES 



Turn topics into 
charged messages 
in the form of 
a sentence. 


ARRANGE 

MESSAGES 





7 


Place messages in 
an order that creates 
the most impact. 


page 120 to 121 


page 126 to 134 









ADD SUPPORTING 
POINTS 


STRENGTHEN THE 
TURNING POINTS (TP) 



Each message needs 
supporting evidence in 
the form of slides. 


Get your acts together! 
Ensure you have a clear 
beginning, middle, and end 
with strong turning points. 


page 128 to 129 


page 38 to 39 
page 42 to 45 


VERIFY 

CONTRAST 


■ 
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Validate the content contour, 
emotional contrast, and 
delivery contrast. 


Once the message and the 
structure are final, turn the 
words into pictures. 


page 46 to 47 page 140 to 141 

page 136 to 137 
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Everything has inherent structure. A leaf, a building, and even ice cream 
each have a (molecular) structure. Structure drives the shape and expres¬ 
sion of everything. The same is true for presentations. How they are struc¬ 
tured determines how they are perceived. Changes to the structure, 
whether grand or small, alter the receptivity of the content. 

To validate the structure, pull your presentation out of the linear slide¬ 
making environment and look at the structure spatially and holistically 
to ensure that it is sound, then arrange the flow for greatest impact. 

Structure allows your audience to follow your thought process. If you 
don’t have clear structure then you end up jumping around and making 
random connections to ideas that are unclear to the audience. Solid 
structure causes ideas to flow logically and helps the audience see how 
the points connect to each other. 
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Create a S.T.A.R. Moment 


Create a moment where you dramatically drive the big 
idea home by intentionally placing Something They’ll 
Always Remember—a S.T.A.R. moment— in each pre¬ 
sentation. This moment should be so profound or so 
dramatic that it becomes what the audience chats about 
at the watercooler or appears as the headline of a news 
article. Planting a S.T.A.R. moment in a presentation 
keeps the conversation going even after it’s over and 
helps the message go viral. 

Since you might be presenting to an audience that sees 
lots of presentations—like a venture capitalist or a cus¬ 
tomer who is reviewing several vendors—you want to 
stand out two weeks after you presented, when they're 
making their final decision. You want them to remember 
YOU instead of all the other presenters they encountered. 

The S.T.A.R. moment should be a significant, sincere, 
and enlightening moment during the presentation that 
helps magnify your big idea—not distract from it. 

There are five types of S.T.A.R. moments: 

• Memorable Dramatization: Small dramatizations con¬ 
vey insights. They can be as simple as a prop or demo, 
or something more dramatic, like a reenactment or skit. 

• Repeatable Sound Bites: Small, repeatable sound 
bites help feed the press with headlines, populate and 
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energize social media channels with insights, and give 
employees a rally cry. 

• Evocative Visuals: A picture really is worth a thou¬ 
sand words—and a thousand emotions. A compelling 
image can become an unforgettable emotional link to 
your information. 

• Emotive Storytelling: Stories package information in 
a way that people remember. Attaching a great story 
to the big idea makes it easily repeatable beyond 
the presentation. 

• Shocking Statistics: If statistics are shocking, don’t 
gloss over them; draw attention to them. 

The S.T.A.R. moment shouldn’t be kitschy or cliche. 
Make sure it’s worthwhile and appropriate, or it could 
end up coming off like a really bad summer camp skit. 
Know your audience and determine what will resonate 
best with them. Don’t create something that’s overly 
emotionally charged for an audience of biochemists. 

S.T.A.R. moments create a hook in the audience’s 
minds and hearts. They tend to be visual in nature and 
give the audience insights that supplement solely audi¬ 
tory information. 


Famous S.T.A.R. Moments 



RICHARD FEYNMAN 

Richard Feynman helped investigate 
the space shuttle Challenger disaster. 
He quickly identified the failure of a 
crucial O-ring as the probable cause 
of the explosion. To illustrate his point, 
he bent and clamped a piece of the 
rubber O-ring and secretly placed it 
in a cup of ice water. At a perfectly 
timed moment, he loosened the clamp 
and as the rubber slowly uncurled he 
said, “...[F]or more than a few seconds, 
there is no resilience in this particular 
material when it is at a temperature 
of 32 degrees.” 1 The press went nuts 
because it should have expanded in a 
millisecond, www 



BILL GATES 

Through his philanthropy, Bill 
Gates hopes to solve some of the 
world’s biggest problems, includ¬ 
ing malaria. In his 2009 TED talk, 
Gates established the gravity of this 
disease by stating that millions have 
died, and 200 million people are 
suffering from it at any given time. 
He then stated that more money is 
spent developing baldness drugs 
on behalf of wealthy men than on 
fighting malaria for the poor. At 
that moment, he released a jar of 
mosquitoes into the room saying, 
“There's no reason only poor peo¬ 
ple should have the experience.” 2 

WWW 



STEVE JOBS 

Steve Jobs is a master at unveiling 
Apple products in intriguing ways. 
“This is the MacBook Air," he said in 
January 2008, “so thin it even fits 
inside one of those envelopes you 
see floating around the office.” With 
that, Jobs walked to the side of the 
stage, picked up one such envelope, 
and pulled out a MacBook Air. The 
audience went wild as the sound of 
hundreds of cameras clicking and 
flashing filled the auditorium. “You 
can get a feel for how thin it is. It has 
a full-size keyboard and full-size dis¬ 
play. Isn’t it amazing? It’s the world’s 
thinnest notebook,” said Jobs. 3 
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Case Study: Michael Pollan 

Memorable Dramatization 


Michael Pollan is a natural storyteller who teaches people 
where food comes from. His books, The Omnivore's Dilemma 
and in Defense of Food, have reshaped how Americans think 
about the current food system. 

When Pollan spoke at Pop.'Tech in the fall of 2009, there 
was one point in particular where he wanted to leave a deep 
impression on the audience. He and his team had calculated 
how much crude oil it takes to create a fast food double 
cheeseburger. It was a staggering amount, and he wanted 
that message to stick. 

When he was introduced at the beginning of his presenta¬ 
tion, Pollan walked on stage carrying a paper bag from a 
fast food chain. “A little something for later,” he said. He 
placed it on a table in the middle of the stage and started 
his presentation—thereby leaving the audience in suspense 
about the prop on the table. 

Later, when Pollan was drawing connections between oil 
and the food supply, he said, “I want to show you how much 
oil goes into producing this [cheeseburger].” He pulled out 
the burger from the paper bag. Then he pulled out an empty 
eight-ounce glass and a container full of oil. He filled the glass 
with oil. “But that's not all. You need another eight ounces.” 
He reached under the table and pulled out a second glass. 
Then he did it again. And again. In all, it took twenty-six 
ounces of oil to produce one double cheeseburger, www 

Showing the audience the burger next to the crude oil used 
to produce it was a disturbing visual—one that the audience 
would almost certainly remember the next time they made 
food choices. 
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Repeatable Sound Bites 


If people can easily recall, repeat, and transfer your mes¬ 
sage, you did a great job conveying it. To achieve this, you 
should have a handful of succinct, clear, and repeatable 
sound bites planted in your presentation that people can 
effortlessly remember. 

A thoroughly considered sound bite can create a S.T.A.R. 
moment—not only for the people present in the audience 
but also for the ones who will encounter your presentation 
through broadcast or social media channels. 

• Press: Coordinate key phrases in your talk with the 
same language in the press release. Repeating critical 
messages verbatim ensures that the press will pick up 
the right sound bites. The same is true for any camera 
crews that might be filming your presentation. Make 
sure you have at least a fifteen- to thirty-second mes¬ 
sage that is so salient it will be obvious to the reporter 
that it should be featured in the broadcast. 

• Social Media: Create crisp messages. Picture each 
person in the audience as a little radio tower empow¬ 
ered to repeat your key concepts over and over. Some 
of the most innocent-looking audience members have 


fifty thousand followers in their social networks. When 
one sound bite is sent to their followers, it can get re¬ 
sent hundreds of thousands of times. 

• Rally Cry: Craft a small, repeatable phrase that can 
become the slogan and rallying cry of the masses trying 
to promote your idea. President Obama’s campaign slo¬ 
gan, "Yes We Can,” originated from a speech during the 
primary elections. 

Take time to carefully craft a few messages with catchy 
words. For example, Neil Armstrong used the six hours 
and forty minutes between his moon landing and first 
step to craft his statement. Phrases that have historical 
significance or become headlines don’t just magically 
appear in the moment; they are mindfully planned. 

Once you’ve crafted the message, there are three ways 
to ensure the audience remembers it: First, repeating the 
phrase more than once. Second, punctuating it with a 
pause that gives the audience time to write down exactly 
what you said. And finally, projecting the words on a slide 
so they receive the message visually as well as orally. 


BELOW ARE A FEW RHETORICAL DEVICES THAT CREATE A MEMORABLE SOUND BITE. 


• Imitate a famous phrase: Golden Rule: Do unto others 
as you would have them do unto you. 

Imitation: Never give a presentation you wouldn’t 
want to sit through yourself. 

• Repeat words at the beginning of a series: “It was 

the best of times, it was the worst of times, it was the 
age of wisdom, it was the age of foolishness...” 
Charles Dickens, A Tale of Two Cities 


• Repeat words in the middle of a series: “We are trou¬ 
bled on every side, yet not distressed; we are perplexed, 
but not in despair; persecuted, but not forsaken; cast 
down, but not destroyed...” 

Apostle Paul to the Corinthians 

• Repeat words at the end of a series: “...and that gov¬ 
ernment of the people, by the people, for the people 
shall not perish from the earth.” 

Abraham Lincoln, Gettysburg Address 
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“Never in the field of 
human conflict was 
so much owed by 
so many to so few.” 

Winston Churchill 


“That’s one small step 
for (a)* man, one giant 
leap for mankind.” 

Neil Armstrong 


“Mr. Gorbachev, tear 
down this wall.” 


Ronald Reagan 


“If it doesn’t fit, you 
must acquit.” 


Johnny Cochran 


*When Armstrong composed this phrase, it included an "a.” But the transmission dropped it and critics 
thought he’d botched it. Recent analysis of the recording shows evidence it was spoken and it was 
dropped in transmission . 4 


“I float like a butterfly 
and sting like a bee.” 

Muhammad Ali 
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Evocative Visuals 


Images can evoke the full range of emotion, from pain 
to pleasure. While using eloquent, descriptive words is 
one way to create an image, a photograph or illustra¬ 
tion can frequently leave a more vivid imprint in the 
audience’s hearts and minds. When the human mind 
recalls an image, it also recalls the emotion associated 
with the image. 

Your presentation can use one large full-screen image 
to convey a point, or pair images together to create 
conflicted emotions like the examples on the right. 

Two recent occasions were publicized on an interna¬ 
tional scale through images of ink-stained fingers. In 
one, fingers were stained to prevent double voting. In 
the other, fingers were stained to tyrannically enforce 
voting. Each evoked very different emotions. 


January 30, 2005: Iraqis voted for the first time since the 
fall of Saddam Hussein. Militants tried to stop the voting 
by setting off dozens of explosives that shook Baghdad. 
Proud citizens raised their purple digits (showing they 
had voted) as gestures of support for democracy and in 
defiance of terrorist threats. 

June 27, 2008: After Robert Mugabe was defeated in 
Zimbabwe’s presidential elections, he mandated a run¬ 
off ballot on which he was the only candidate and 
resolved to hold onto power through fraud, corruption, 
and intimidation. Voters in Zimbabwe were required to 
show their ink-stained fingers to prove they had voted. 

If they didn’t, they could be beaten and forced to vote, 
and they would face severe consequences at the hands 
of government agents. 



It’s effective to recall real events like the stories above, 
but using images often conveys emotional force that 
words cannot match—particularly when abstract issues 
like democracy and tyranny are involved. 



Conservation International uses 
dreamlike images of the ocean 
juxtaposed to rubbish washed up on 
the beach. The contrast is jarring and 
compels the audience to understand 
why the oceans are so important, be 
ready to take action to improve policy, 
change business practices, and make 
better choices in their daily lives. 
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Zimbabwe woman 
feeling scared, intimi¬ 
dated, and defeated. 


The gesture and ink 
are similar but have 
entirely different 
emotional meanings. 


Iraqi women feeling 
joyful, free, and defiant. 
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Case Study: Pastor John Ortberg 

Emotive Storytelling 


Storytelling creates the emotional glue that connects 
an audience to your idea. Creating unique, inspirational 
messages every week is demanding, and Pastor John 
Ortberg of Menlo Park Presbyterian relies heavily on his 
own life stories to illustrate his messages. 

Ortberg’s ability to weave stories into his messages is a 
big part of his trademark style and appeal. He spends as 
much time as he needs to word-craft and story-craft his 
messages together like a tapestry. He develops a master 
theme supported by Scripture, then carefully weaves 
personal stories throughout. It’s very similar to the woof 
and warp of a loom. The master theme and Scripture 
hold the latitudinal messages together, and the stories 
are like the yarn that shuttles back and forth, creating 
patterns in the fabric. 

The sermon analyzed on the next few pages was the 
first I heard Ortberg deliver, www I was intrigued by its 
structure and its ability to move me. The master theme 
was "people can bring the Kingdom of Heaven to this 
Earth by showing love.” He sprinkled several stories 
through the sermon, but there was one master story 
that was referenced and carefully woven throughout: 
that of his sister's rag doll, Pandy. After telling the rag 
doll story (below) at the beginning, he continued to 
use it like glue with references to raggedness through¬ 
out the sermon. 

The master story conveyed that people want to be 
loved in spite of their ragged condition: 

“Now Pandy had lost most of her hair, one of her eyes, 
and one of her arms. But she was still my sister Barbie’s 
favorite doll. She was not a very valuable doll. I don’t 
think we could have given her away. She was not a very 


attractive doll. In fact, she was kind of a mess. But in 
the way that little kids do, for reasons that no one 
could quite understand, Barbie loved that little rag 
doll. So when Barbie ate, Pandy was next to her; when 
Barbie slept, Pandy was next to her; when Barbie took 
a bath, Pandy was next to her. Love Barbie, love her rag 
doll—it’s a package deal. Other dolls came and went. 
Pandy was family. 

“This is how strong that love went. One time, we took a 
vacation from Rockford, Illinois, up to Canada; and, of 
course, Pandy went with us. When we came back home, 
we realized Pandy had not made the return trip. Pandy 
had stayed in the hotel back in Canada. No other option 
was thinkable. My father turned the car around, and we 
drove from Rockford to Canada to get that doll because 
we were a devoted family. Not a very bright family, really, 
but a devoted family. And we tracked Pandy down. 

“Pandy had never been worth very much to start with. 
By now, she was so disfigured that the only logical 
thing to do was to trash her. Get rid of her. But Barbie 
loved that doll with a love that made that doll precious 
to anybody that loved Barbie. Love Barbie, love her rag 
doll. It’s a package deal. 

“She did not love Pandy because Pandy was beautiful. 
She loved her with a love that made Pandy beautiful.” 

Ortberg ended his sermon by coming back to the premise 
of the opening story. Returning the congregation to the 
opening narrative took them to where they started with 
new, enlightened insights that made the story more 
meaningful and complete. 
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What is What could be 


Ortberg’s Sparkline 


Establish What Could Be 

After telling the rag doll story, he equates it with how 
human love works on Earth versus the way heavenly 
love works on Earth. “There is a kind of love that seeks 
value in what is loved. There is a kind of love that is 
drawn to its object or a person because that person is 
attractive or that object is expensive or is important 
or can give me status or make me feel good. There is a 
kind of love that seeks value in what is loved, and there 
is a kind of love that creates value in what is loved.” 


Repeat the Theme 

Ortberg jars the congregation a second time with the rag 
doll theme, saying that if you love God, you have to love 
His rag dolls, because nobody is perfect. “Jesus just kind 
of has one request. Christian faith is not real complex. We 
make it so complicated. It ain’t rocket science. John puts 
it like this: r ...[S]ince God so loved us, we ought also to 
love one another.” Jesus says, 'Love me, love my rag dolls.’ 
It’s a package deal. You can’t have one without the other.” 
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Ragged Theme 

Ortberg uses phrases 
that emphasize concepts 
from the rag doll story 
to make the point that 
people are broken but 
still lovely and lovable. 


I Mill I I 

Kingdom Theme 

Ortberg uses the Kingdom 
as a master theme. Several 
times he contrasts the dif¬ 
ference of how people love 
on Earth as opposed to the 
type of love expressed in 
the Kingdom of God. 


I Mill 

I I Mil I 

Big Idea 

Ortberg weaves story and Scripture 
together to convey his message 
but is careful to continually repeat 
his idea throughout his sermon. He 
brings the congregation back to 
the theme of love: “Wanna know 
how to break God’s heart? Just 
don’t love someone.” 


158 


Resonate 
























Call to Action 

Ortberg concludes by convincing congregants that someone's value is determined 
by how loved they are. So he challenges them to call someone they hadn’t yet told 
“I love you.” “There is a kind of love that looks for value in what is loved, that looks 
for what is shiny and rich and impressive; but there is a kind of love that takes rag 
dolls and makes up there come down here ... Maybe you are aware right now there 
is somebody in your life that needs to hear you say, ‘I love you.' Before the day is 
done, you need to look someone in the eyes, or you need to pick up the phone or 
pick up a pen. There are some words you need to say.” 



II I I 


Emotional Moments 

Ortberg chokes up 
twice during his sermon. 
Once when he repeats 
verses from an old song 
and again at the end 
of the sermon as he 
conveys the magnitude 
of what he’s asking the 
congregation to do. 
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Case Study: Rauch Foundation 

Shocking Statistics 


In 2002, a small group of Long Island’s civic, academic, 
labor, and business leaders gathered to discuss chal¬ 
lenges facing the region and its potential for new 
directions. As a result of those meetings, The Rauch 
Foundation funded the Long Island Index to gather and 
publish data on the Long Island region. Their operating 
principle was “Good information presented in a neutral 
manner can move policy." The goal was to be a catalyst 
for action by engaging the community in thinking about 
its future from a regional perspective. 

Even though the Long Island Index had served up 
valuable data about the past and present, the hope to 
drive action to make the future better hadn't seen 
much traction. 

The local Long Island newspaper, Newsday, reported, 
"Last year, Index founder Nancy Rauch Douzinas chal¬ 
lenged people to adopt a let’s-do-something-about-this 
attitude. But the attitude, like action, hasn't materialized. 
So the Index is adopting an attitude of its own. It still will 
present data neutrally, and it won't take sides, but it will 
be much more active in trying to make sure that its ideas 
and its sense of urgency don’t end when the lights come 
on after the annual presentation.” 5 


So at the 2010 press launch of the Index, The Rauch 
Foundation pulled out key statistics and incorporated 
that information into a presentation. Dramatizing the 
key statistics with images helped convey the inventive¬ 
ness and sense of urgency that would be required to 
manage growth with better environmental outcomes. 
Titled The Clock is Ticking, this four-and-a-half minute 
presentation showed one image after another to drive 
home the idea that Long Island is in steady decline and 
must do something—right now! www 

“For seven years, the Long Island Index produced many 
reports filled with facts and figures that told people 
how poorly our region was faring. When we shifted to 
telling the story visually, the reaction was electric. The 
information was the same, but the new format commu¬ 
nicated the issues with an emotional urgency. The visual 
story moved citizens and elected officials to address the 
problems with an understanding that there was no more 
time to lose.” 

Nancy Douzinas 
President, Rauch Foundation 
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Case Study: Steve Jobs 

MacWorld 2007 iPhone Launch 


Steve Jobs has the uncanny ability to make audience 
engagement appear simple and natural. His presenta¬ 
tions compel an audience's undivided attention for 
an hour and a half or more—something that very few 
presenters can do. 

“Steve Jobs does not deliver a presentation. He offers 
an experience.” 


Carmine Gallo 6 

Jobs's reputation for marketing brilliance already has the 
audience coming in to the presentation in a frenzied 
state of excitement, and he brilliantly keeps them there 
with dramatic suspense and an intriguing delivery. This is 
an uncommon skill for a CEO, or anyone, for that matter. 

Jobs purposefully builds anticipation into each of his pre¬ 
sentations—which have been described as an “incredibly 
complex and sophisticated blend of sales pitch, product 
demonstration, and corporate cheerleading, with a dash 
of religious revival thrown in for good measure.” 7 Over the 
years, he has used every type of S.T.A.R. moment. Below 
are four from his 2007 iPhone launch presentation, www 

• Repeatable Sound Bites: During the keynote address, 
Jobs used the phrase “reinvent the phone” five times, 
the same phrase that Apple used in their press release. 
After walking through the phone’s features, he ham¬ 
mered it home once again: “I think when you have a 
chance to get your hands on it, you'll agree; we have 
reinvented the phone.” The next day, PC World ran a 
headline stating that Apple would “reinvent the phone.” 8 


• Shocking Statistics: Jobs didn’t just state a large num¬ 
ber; he put the scale of that number into a context the 
audience would understand. “We are selling over five 
million songs a day now. Isn’t that unbelievable? Five 
million songs a day! That’s fifty-eight songs every sec¬ 
ond of every minute of every hour of every day.” 

• Evocative Visuals: The audience 
laughed when he said, “Today 
Apple is going to reinvent the 
phone, and here it is....” He then 
showed an iPod faked-up to look 
like it had an old rotary dial on it 
to tease the audience. 

• Memorable Dramatization: In the past, Jobs had pulled 
an iPod out of his coin pocket and removed a MacBook 
Air from an interoffice envelope. For this launch, a fea¬ 
ture of the product itself created the dramatic moment. 
The new interface was so revolutionary that the audi¬ 
ence gasped the first time he used the scrolling feature. 
Later, Jobs said, “I was giving a demo to somebody 

a little while ago at Apple. I finished the demo and I 
said, 'What do you think?’ He told me this: ‘You had 
me at scrolling.’” 

Notice on page 164 to 165 how the bulk of his presenta¬ 
tion centers on what could be. Not many presenters can 
sustain the momentum there, yet he keeps interest with a 
tightly rehearsed demo that showcases the revolutionary 
new features and demonstrates them in humorous and 
unexpected ways. See page 139 for a master list of ways 
to deliver contrast. Jobs incorporates many of these in 
his presentations too. 



Note: Duarte Design does not work with Steve Jobs. This example was chosen for its 
historical significance as one of the greatest product launch presentations of all time. 
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What is What could be 


Jobs’s Sparkline 


Establish What Could Be 

“This is a day I've been looking forward to for two and a 
half years. Every once in a while, a revolutionary product 
comes along that changes everything.... Today we’re intro¬ 
ducing three revolutionary products of this class. The first 
one is a widescreen iPod with touch controls. The second 
is a revolutionary mobile phone, and the third is the break¬ 
through Internet communications device. So three things: A 
widescreen iPod with touch controls, a revolutionary mobile 
phone, and a breakthrough Internet communications device. 
An iPod, a phone, and an Internet communicator. An iPod, 
a phone...are you getting it? These are not three separate 
devices. This is one device. And we are calling it iPhone.” 


Lure with Suspense 

Jobs has a magical sense for 
creating suspense. For fifteen 
minutes, he reviews the hard¬ 
ware features of the iPhone by 
clicking through photos of the 
device while it is turned off. Yes, 
off! When he finally powers up 
the iPhone and demonstrates 
the scrolling feature for the first 
time, the audience gasps and 
breaks into roaring applause. 







T 

x 


Video 

mmm Demo Product 
Guest Speaker 
Speaking 









■jf S.T.A.R. Moment 

1- 









1 


T" 


i 



i 


0:00 

0:10 

0:20 

0:30 



0:40 


Laughter III II 

llll II 

1 III III 1 1 III 

llll 1 

ii i 


1 

1 1 1 

Clap II I llll II I 

III III 1 

II 1 1 II I 

ii i iiiii 

i i 

i i i 

i 

1 1 II III 

1 II II 

Marvel 1 III II llll 1 

1 II I 

1 1 II ill 1 II 

mi in 


i i i 

Mill IIIII IIIII III 1 

i ii min 


Establish What Is 

Jobs sets up what is in perfect form. He gives an 
update on the market and performance of several 
products: Intel transition, retail stores, iPod, iTunes, 
and Apple TV. He demos the recently released 
Apple TV. 


Create Contrast 

Jobs comes back down to 
what is a few times in the 
speech by comparing the 
iPhone features with current 
products on the market that 
amplify the magnitude of 
this breakthrough. 
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Keep Them Engaged 

When Jobs demos the new features, he 
doesn’t merely go through a checklist of 
the features—he plans clever scenarios. 
Every thirty seconds or so, he showed a 
new feature by completing a task the way 
a real user would. He makes phone calls to 
a colleague while another colleague calls 
him; he checks his visual voicemail and 
plays a message from Al Gore congratulat¬ 
ing him on the launch; he calls Starbucks to 
order four thousand lattes to go. He varied 
the tasks in his demo forty-seven times to 
make it a riveting demonstration. 


The New Bliss 

Jobs ends his presentation having enthusiastically moved his 
audience from what is to what could be. But he doesn't stop there. 
He reminds them of Apple's revolutionary product heritage and 
assures them that they’ll do this again. His ending sets the stage for 
a new beginning. “I didn’t sleep a wink last night. I was so excited 
about today because we’ve been so lucky at Apple. We’ve had 
some real revolutionary products. The Mac in 1984 is an experience 
that those of us that were there will never forget, and I don’t think 
the world will forget it either. The iPod in 2001 changed everything 
about music. We’re going to do it again with the iPhone in 2007. 
We’re very excited about this. There’s an old Wayne Gretzky quote 
that I love: ‘I skate to where the puck is going to be, not where it 
has been.’ We’ve always tried to do that at Apple since the very, 
very beginning, and we always will. Thank you very, very much." 
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Make Them Marvel 

Jobs creates a sense of wonder by 
interjecting phrases that invite the 
audience to marvel at the product. 
A few examples of the language 
he uses: “This is a revolution of the 
first order—to really bring the real 
Internet to your phone!... Isn’t that 
great? ... So we think this is pretty 
cool. ...We’ve designed something 
wonderful for your hand, just won¬ 
derful. ... It’s pretty awesome.” 


Invite Guest Speakers 

Jobs invited three 
partners to present. 
The first two breezed 
through their parts 
but the Cingular/AT&T 
CEO read through cue 
cards, repeated what 
was already said, and 
rambled way longer 
than he should have. 
Too bad. 


Be Flexible 

When the clicker stops working, he 
pauses, smiles, and fills the time it 
takes to fix it with a funny story about 
how he and Steve Wozniak used a TV 
jammer as a prank on unsuspecting 
college students when they were in 
high school. Carmine Callo said, “In 
this one-minute story, Jobs revealed a 
side of his personality that few people 
get to see. It made him more human, 
engaging, and natural. He also never 
got flustered.” 9 
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It's a great feeling to deliver a presentation after which everyone is buzzing 
with excitement at the watercooler; or your presentation is splashed on 
front-page news; or social media sites pick it up—and suddenly millions have 
seen your presentation. 

The presentations that get repeated have memorable moments in them. 
These moments don't happen on their own; they are rehearsed and planned 
to have just the right amount of analytical and emotional appeal to engage 
both the minds and hearts of an audience. 

Captivate your audience by planning a moment in your presentation that 
gives them something they’ll always remember. 
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Amplify the Signal, Minimize the Noise 


A presentation broadcasts information to an audience 
in much the same way that a radio broadcasts pro¬ 
gramming to listeners. Thus, the signal’s strength and 
clarity determine how well information is conveyed to 
its intended recipients. Communication is a complex 
process that has many points at which the signal can 
break down. Once a message has left its sender, it is 
susceptible to interference and noise, which can cloud 
its intention and compromise the recipient’s ability to 
discern the meaning. 

Communication has the following parts: sender, trans¬ 
mission, reception, receiver, and noise. The message can 
become distorted at any step of this process. Your top 
priority is to ensure that the message-carrying signal is 
free from as much noise or interference as possible. 

Presentation development works the same way. Every 
step of the process either enhances the signal or creates 
noise that causes the audience to tune out. 

My own high-tech career began in 1984 selling custom 
high-frequency cable assemblies. Each cable was cus¬ 
tom engineered to meet an extensive list of specifica¬ 
tions. The task of every engineer and plant employee 
was to ensure that each step in the manufacturing 
process reduced the noise margin and protected the 


signal's quality. We tested raw materials, insulated wire 
with advanced materials, and produced gold-plated ter¬ 
minators. We fussed over everything at each stage and 
then tested everything before it was shipped. If it didn't 
fall within a tight impedance tolerance, we couldn’t ship 
it, because it wouldn’t work for the client’s application. 
One small error would render the cable useless. 

The same is true for creating a great presentation. The 
signal-to-noise ratio is an important factor in how well 
your message is received, and it’s your job to minimize 
the noise. If the audience receives a message that includes 
any interference, they receive distorted information. You 
must expend energy minimizing the noise in each step of 
the communication process to ensure that a crystal-clear 
message gets through to your audience. 

There are four main types of noise that can interfere 
with your signal: credibility, semantic, experiential, and 
bias noise. The graphic to the right shows where the 
various types of noise occur in communication. Your 
job is to minimize the noise as much as possible at each 
step of the process. 

This section will address some of the factors that create 
noise. Noise can be reduced or eliminated through care¬ 
ful planning and rehearsing. 
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The Role of Noise in Communication' 
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You cannot control 
audience interpretation, 
but by knowing them, 
you can circumvent 
some bias noise. 



There's Always Room to Improve 171 

























Give a Positive First Impression 


The old adage is true: You never get a second chance 
to make a first impression. But is telling a joke or using 
a cheesy icebreaker really the way to start? 

Make some creative choices before the presentation 
begins. What is the first thing you want your audience 
to experience? What's the first impression you want 
them to have? In what kind of mood should your intro¬ 
duction put them? These choices aren’t driven only by 
what you say. Moods can be set by the type of room, 
lighting, music playing, items on the chairs, image 
projected on the screen, clothing you’re wearing, 
or entrance you make. 

No matter how much you want the audience to like 
you for your mind and not for how you look, their first 
impression—for a while, at least—is based on what they 
see. Within the first few seconds, people will classify 
you somewhere in their minds and judge whether they 
will be able to connect with you or not. 

Aristotle argued against letting first impressions influ¬ 
ence the perceived validity of the content. He said, 
"Trust... should be created by the speech itself, and not 
left to depend upon an antecedent impression that the 
speaker is this or that kind of man.” In ancient Greece, 
however, oration was very sophisticated and followed 
many rules. Most people in today’s audiences are a bit 
more shallow and will use the first few crucial seconds 
to judge you. 

Fear of being judged makes so many people afraid of 
public speaking. But it’s in your power to shape that 
first impression. Don’t allow yourself to be intimidated. 
You’d be amazed to hear what’s on the audience’s 
minds before you begin your presentation. With the 
advent of social media, you can see—and take comfort 
in—just how shallow and mindless the stream is. These 


are actual comments made by an audience while wait¬ 
ing for a presentation to begin: "Scalding hot coffee in 
a room packed with socially inept people means I now 
have a burned hand.” “I hope there’s no line in the ladies 
room.” “I hope she has gone to Toastmasters since the 
last time she spoke.” “Aw, man, I missed mimosas at regis¬ 
tration this morning.” 

Yep, that’s the stuff on their minds before you present. 
Their expectations are pretty low and self-focused, so 
creating a memorable first impression with these folks 
shouldn’t be too tough. 

First impressions do not have to be overly dramatic 
or gimmicky. They’re about revealing your character, 
motivations, abilities, and vulnerabilities. You’re asking 
the audience to walk in your shoes, and they don’t even 
know if they like you yet, let alone your taste in shoes. 
So establishing who you are and your own likability 
is paramount. 

Be aware that part of the first impression the audience 
has formed about you had already been made before 
you entered the room. Consider all the communica¬ 
tions you sent before the presentation. What was the 
invitation like? How was the agenda framed? How was 
the e-mail worded? What was your bio like? Since all 
the interactions leading up to the actual presentation 
create the real first impression, make sure you’ve framed 
them appropriately. 

Successful first impressions introduce you and your mes¬ 
sage in a way with which the audience can identify. It’s 
the nature of all audiences to compare themselves to you 
and look for similarities and differences. Make these simi¬ 
larities and differences clear as they size you up, so they 
get over that phase quickly. Create a common identity 
between you and them. 
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The audience learns a lot about you based solely 
on how you appear for the first time. 

When I first started hosting Slide:ology workshops 
here at Duarte Design, I kept trying to squeeze an 
entire workday in before the workshop started at 
9:00 am. I'd blast into the room, test the projec¬ 
tor, cross-check files, and jump into the material. 

I was busy, distracted, and wound up pretty tight. 
Any of the poor early birds got a clear “I’m super 
busy and was trying to squeeze in an entire work¬ 
day before you got here” message. I noticed that 
the crowd wasn't warm or receptive to me. 

Then, I attended a workshop hosted by a friend 
of mine and author of Presentation Zen, Garr 
Reynolds. Before his presentation, he entered the 
room upbeat and engaged, shook hands, asked 
attendees questions, and set a completely differ¬ 
ent tone. They perceived that he had all the time 
in the world for them. Right off the bat, he came 
across as carefree and warm. Even though our 
content was of similar nature, he had them eating 
out of his hands before he said the first word, 
and I didn't. 


This is my “do you really think 
I have time for this” look. 



This is my “I have all the time 
in the world for you” look. 
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Hop Down from Your Tower 


Have you ever sat through a presentation where—even 
though the presenter sounded super smart—you had no 
idea what he was really saying? 

Most highly specialized fields like those in science and 
engineering have a distinct lexicon that's used every 
day—one that’s familiar to experts but foreign to anyone 
not in that field. Innovation is happening very quickly, 
and each new field generates a plethora of new terms 
weekly. If you’re an expert, you can’t assume that people 
have kept up with your field. Using highly specialized 
jargon when you’re addressing nonspecialists can hamper 
your efforts and reduce the amount of help you receive 
from them—solely because they don’t understand what 
you’re saying. You need to modify your language so it 
resonates with the potential collaborators and funders 
of your idea. 

Before specialists acquired their newfangled vocabulary, 
they used the common language of the masses. But as 
they studied their narrow fields, specialized terms and 
jargon snuck in. It’s like the folks who built the Tower of 
Babel. Originally, they all spoke a unified language. But 
due to their pride, their language was confused and they 
were scattered throughout the earth. 

When presenting to a broad audience, you need to go 
back to a common and unified language so your audi¬ 
ence doesn’t scatter in confusion. Even though it’s fun 
to sound smart—and yes, to confound others with your 
awesome smarty-pantness—this hinders the adoption of 
your idea when you’re speaking to a group that isn’t as 
specialized as you. 

“Speaking in jargon carries penalties in a society that 
values speech free from esoteric, incomprehensible 
bullshit. Speaking over people’s heads may cost you 
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a job or prevent you from advancing as far as your 
capabilities might take you otherwise.” 


Carmine Gallo 2 

If your idea requires the use of special terminology, you 
must be prepared to translate it into intelligible words 
that laymen can understand. It’s imperative that you 
know how and when to switch between specialized and 
common language. Don’t choose words that fall outside 
your listener’s vocabulary. 3 Tailor your language to what 
the audience uses. 

For example, a great Nobel Laureate in Physiology, 
Barbara McClintock, discovered in the 1940s that genes 
are responsible for turning physical characteristics on or 
off. However, her groundbreaking research was greeted 
with skepticism and wasn’t fully understood until the 
1970s because of her communication style. McClintock 
had a vivid inner vision and a rapid-fire delivery. She 
would often leap back and forth from microscopic 
observation to model to conclusion to result—all in a 
single sentence! Most audiences were ill-prepared, or 
simply too lazy, to work hard enough to master the data 
that poured forth from her. The way she communicated 
caused her findings to lie hidden for years! 4 

Jargon isn’t confined to specialized professions. Many 
good ideas die because they fail to navigate the very 
organization from which they originate. Different 
departments within the same entity often use different 
languages, which can create internal confusion. In some 
meetings, more acronyms are spoken than real words. 

An audience will not adopt your idea unless they under¬ 
stand it. Your idea’s perceived value will be judged 
not so much on the idea itself but on how well you can 
communicate it. 



Value Brevity 


Presentations fail because of too much information, not 
too little. Don’t parade in front of the audience spew¬ 
ing every factoid you know on your topic. Only share 
the right information for that exact moment with that 
specific audience. 

Abraham Lincoln constructed the Gettysburg Address 
with 278 words and delivered it in just over two min¬ 
utes. Though one of the shortest speeches in history, it 
is also considered to be one of the greatest. 

The speech’s purpose was to dedicate the Gettysburg 
cemetery and eulogize the fallen. Though eulogists at 
that time traditionally took hours, Lincoln was so quick 
that the photographers were still setting up their equip¬ 
ment as he finished; hence we have no photos of him 
delivering the speech. 

Most people aren’t even aware that Lincoln wasn’t the 
featured speaker that day. Edward Everett shared the 
platform and delivered a eulogy in the traditional style, 
spending two hours praising the virtues of the soldiers. 
The day after the speech, Lincoln received a note from 
Everett that complimented him for the “eloquent sim¬ 
plicity and appropriateness” of his remarks. Everett said, 
“I should be glad, if I could flatter myself that I came as 
near to the central idea of the occasion, in two hours, 
as you did in two minutes.’’ 5 
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Lincoln had two hours and took two minutes. This forced 
him to make the central ideas clear. Even though it’s 
brief, Lincoln’s address still covers the key components 
of the presentation form. He discusses what is by stating 
historical national values, the current war situation, and 
the purpose of the gathering. He startles the audience 
by claiming that they cannot dedicate or consecrate the 
ground, although that’s what they thought they were 
there to do. Instead, he proposes a call to action: That 
the crowd resolve that the dead shall not have died in 
vain. He then describes the new bliss of a free nation. 

One thing that will help you remain brief is to put your own 
constraint on the amount of time you present. Imposing a 
shorter time frame requires you to be succinct. If they give 
you an hour, target a talk at forty minutes. Restriction of 
time forces clear structure and a filtering-down process 
that leaves only imperative messages. 


“If I am to speak for ten minutes, I need a week for 
preparation; if fifteen minutes, three days; if half an 
hour, two days; if an hour, I am ready now.” 


Woodrow Wilson 




WHAT IS 

Four score and seven years ago our 
fathers brought forth on this continent, 
a new nation, conceived in Liberty, and 
dedicated to the proposition that all men 
are created equal. 

Now we are engaged in a great civil war, testing 
whether that nation, or any nation so conceived and 
so dedicated, can long endure. We are met on a 
great battlefield of that war. We have come to 
dedicate a portion of that field, as a final 
resting place for those who here gave their 
lives that that nation might live. It is 
altogether fitting and proper that 
we should do this. 



NEW BLISS 


\ 


It is rather for us to be here dedicat¬ 
ed to the great task remaining before 
us—that from these honored dead we take 
increased devotion to that cause for which 
they gave the last full measure of devotion— 
that we here highly resolve that these 
dead shall not have died in vain —that this 
nation, under God, shall have a new birth 
of freedom—and that government of the 
people, by the people, for the people, 
shall not perish from the earth. 


WHAT COULD BE 


k 


But, in a larger sense, we can not 
dedicate—we can not consecrate—we 
can not hallow—this ground. The brave 
men, living and dead, who struggled here, 
have consecrated it, far above our poor 
power to add or detract. The world will little 
note, nor long remember what we say here, 
but it can never forget what they did here. 
It is for us the living, rather, to be dedi¬ 
cated here to the unfinished work 
which they who fought here have 
thus far so nobly advanced. 







Wean Yourself from the Slides 


Any slides you use during your presentation should 
serve as a stage setting or backdrop; they should rarely 
be the sole focus for the message. You, not the slides, 
deliver the message. People can only process one 
inbound message at a time. They will either listen to 
you or read your slides; they cannot do both. 6 

When you open a slide application to create a new 
slide, the default format you’re offered is appropriate 
for a report. If you fill the default master template with 
words, it will take your audience twenty-five seconds 
to read the slide. Since they can't read and listen at the 
same time, if you have forty slides multiplied by twenty- 
five seconds, they’ll be reading for over sixteen minutes 
of your presentation instead of listening to you. 

By planning the structure first, you can ensure the pre¬ 
sentation won’t go too long. Audiences squirm when a 
frustrated presenter delivers for fifty-five minutes and 
then says, “Oh, man, where’d the time go? I still have 
forty-three slides, so hang on. I’ll get through them in 
the next five minutes.” If you plan a solid structure with 
the time frame in mind, it guarantees you will stay within 
the time constraints. 

What’s the right number of slides? There is no defini¬ 
tive “right” number of slides for a presentation. It’s all 
driven by the personal delivery and pacing of the pre¬ 
senter. So the answer is “as many as necessary to get 
your point across.” Hollywood scene and story analysts 
adhere to the practice of making scenes no longer than 
three minutes for fear of losing the audience’s interest. 7 


Three minutes! Odds are high that your audience is los¬ 
ing interest every three minutes too, and to compound 
the problem, you don’t have a $100 million blockbuster 
movie budget. Because the presentation medium is more 
static than cinema, don’t stay on a slide for any more than 
two minutes. Changing the visuals as often as possible 
helps retain audience attention. 

Most presentations have multiple points per slide and are 
a document, not a slide. If you choose to put only one 
idea per slide, you’ll have more slides than are traditionally 
seen in a slide deck and that’s okay. 

I was invited to speak for forty-five minutes at a lunch¬ 
eon keynote. The organizers asked for the slides to be 
submitted thirty days in advance, so I crafted the mes¬ 
sage, rehearsed it, and sent a deck with 128 slides. 

A week before my talk, I got a call from the organizers 
telling me that the keynote was reduced to twenty min¬ 
utes and to resubmit slides. So I trimmed, rehearsed, and 
timed it for twenty minutes. The day of the presentation, 
the emcee reminded me to “stay within the forty-five 
minute slot because people enjoy a Q&A.” Shocked, I 
told him that they’d reduced the speaking slot to twenty 
minutes. “No, you have an hour. We just told you twenty 
minutes because you had too many slides and we thought 
you’d go long." Internally, I was shouting, “I CREATE 
PRESENTATIONS FOR A LIVING,” but on the outside, 

I smiled and said, “Well, there’ll be a forty-minute Q&A. 

I hope they have a lot of questions.” 
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Slide Content Reduction 

There is a range of slide-content density. The number 
of words and amount of time it takes the audience to 
process the information determine whether you've cre¬ 
ated a dense document or a true visual aide to project 
onto the screen. 

Your goal is to move away from projecting a document 
and toward giving a presentation. Only put elements on 
your slides that help the audience recall your message. 
Reduce large phrases and bodies of copy to single words. 
Simplify the slides so the audience can process each one 
in under three seconds. Remove as much from the slides 
as possible and move material into the notes. You can 
actually put as much information in the notes as you'd like. 

Then, set up the slideshow to project the notes on the 
computer in front of you (Set Up Show/Show Presenter 
View). You can use the machine facing you as your tele¬ 
prompter with all your notes in it, but behind you are 
projected clear, comprehensible slides for the audience. 
That way you won’t miss a beat! 



After hearing the advice to remove as much as possible 
per slide, many react with, “But my boss wants each 
of her direct reports to send in a five-slide overview of 
our initiatives, and if I make sparse slides, she might not 
understand the progress we’ve made.” The boss is not 
asking for a presentation. She’s asking for a document. 
So cram as much as you need into that document to 
make it clear. There’s a time to be sparse per slide when 
you’re presenting and a time to be comprehensive per 
slide when submitting a document. 

When slides are used appropriately, they work with the 
presenter seamlessly like a dance partner on the stage. 
One is coming and the other going, and each contrib¬ 
utes to the other’s stage presence and craft. Practice 
with your slides until you move as one with them. 



VISUAL AIDE 

Only project material 
on the screen that helps 
the audience remember 
your message. 
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Balance Emotion 


Persuasive presentations should appropriately balance 
analytical and emotional appeal. 

Many pages in this book have been devoted to creat¬ 
ing emotional appeal—not because it’s more important 
but because it’s underused or nonexistent and should 
be incorporated. Now that your presentation has plenty 
of emotional appeal, let’s stay cognizant of its appropri¬ 
ate use. 

Some topics are inherently emotionally charged—like 
gun control, racism, or abortion—and therefore natu¬ 
rally lend themselves to more emotional arguments. 

On the other hand, topics like science, engineering, 
finance, and academics inherently invite analytical ap¬ 
peals. But just because a presentation is more heavily 
weighted toward analytical content doesn’t mean it 
should be void of all emotion. 

A question that comes up often is “How much emotion 
should I use when presenting to a group of economists?” 
(You can replace the word economist with others like 
analysts, scientists, engineers, or researchers.') Some 
people choose their careers because of their analytical 


nature. If you know the audience has a career in a stereo¬ 
typical^ analytical space, only a tiny percentage of your 
appeal should be emotional—but do not leave emotion 
out entirely! At the very least, open and close your pre¬ 
sentation with why. Many times the reason why people 
are involved in economics or science or engineering or 
research has an emotional component. Don’t strip it out, 
but don’t overdose on it either. 

There’s another Greek word in the mix, in addition to 
ethos, pathos, and logos (see page 100). That word is 
karios. It means “timing” or “timeliness”—“speaking in 
the right moment, in the right way.” In order to do this, 
you must understand the situation, cross-check, and, if 
necessary, modify your presentation by adjusting its 
emotional and analytical balance so that it’s appropriate. 

Keep in mind that all things in life should be done in mod¬ 
eration-including emotional appeal. Emotion should not 
be overamplified. If it is, the audience will feel manipu¬ 
lated. Appealing to emotion is only effective if it furthers 
the argument. Creating the right balance is alluring, 
whereas imbalance hurts your credibility. 


Modify the presentation to map to the needs of the audience. Increase or 
decrease the emotional and analytical to match the situation. 8 



BROAD AUDIENCE 

ANALYTICAL AUDIENCE 

Mode of response 

Visceral 

Cerebral 

Structure 

Weighted toward story 

Weighted toward report 

Responds to emotion 

Receptive 

Suspicious 

Effective organs 

Heart, gut, groin 

Head 
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Instead of viewing the rhetorical triangle as something that is static and 
must be evenly filled in to achieve perfect balance on all sides, consider it 
dynamic and alter the emotional appeal appropriately based on the situ¬ 
ation. If you're speaking to a broad audience on an emotionally charged 
topic, then don’t go through pages and pages of analytical research. Pull 
back on the brainy material. When speaking to a specialized audience in a 
narrowly analytical field, you need to emphasize analytical content. Notice 
in the far-right triangle how imbalance will diminish your credibility. 



Highly analytical audiences do not 
like having their heartstrings tugged 
too much, if at all. They tend to inter¬ 
pret it as manipulation and an unnec¬ 
essary waste of time. But these folks 
are human, and all humans care, like 
to laugh, and can be touched deeply. 
So, for example, including material in 
a presentation that shows how lives 
will be changed does motivate them, 
unless it’s presented in a melodra¬ 
matic way. 


Emotionally driven audiences don’t 
enjoy overuse of facts and details. 
They want to know that the details 
have been carefully considered, 
but they probably don’t want to 
see twenty slides about them. They 
need a few proof points. A sales 
force might get more fired up about 
the incentive plan than diagrams 
explaining the complex innards of 
how a product ticks. 


Notice with the two triangles on the left, the credibility of the speaker stayed intact. 
That’s because these presentations hit the right balance for the audience. 


CREDIBILITY 



EMOTIONAL ANALYTICAL 

Taking the emotional or analytical 
appeal too far in either direction 
hurts your credibility. Even if you 
are the most qualified presenter 
in the world, being too geeky or 
too emotional can create a chasm 
between you and the audience. 
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Host a Screening with Honest Critics 


We’ve become a first-draft culture. Write an e-mail. Send. 
Write a blog entry. Post. Write a presentation. Present. 
The art of crafting and then recrafting something well is 
disappearing in communications. 

“The first draft of anything is shit.” 


Ernest Hemingway 

It’s easy for us to get attached to our own ideas, so it’s 
good to have another set of eyes and ears to review them. 
The best way to get feedback is to host a screening to 
test your messages before you present. The screening 
should filter out any meandering structure, obstructed 
messages, and confusing language. 

Keep an open mind and come to the screening knowing 
that you will probably need to rework some percentage 
of the communication that you labored over for so long. 
No one ever hears during the first review, "I wouldn’t 
change a thing.” Regardless of how relentlessly you 
worked, there will be changes at this phase. The informa¬ 
tion was created from your perspective. To the degree 
you are receptive to feedback from others, you’ll be able 
to refine the receptivity potential of your material /broth¬ 
ers. The screening should influence and round out how 
you deliver your presentation. 

Conway’s Law states, “Any organization that designs a 
system will inevitably produce a design whose structure 
is a copy of the organization’s communication struc¬ 
ture.” In other words, the quality of communication your 
organization generates is limited to the quality of the 
communication of the organization itself. For this reason, 
a presentation’s quality will not exceed the quality of 
the planning process that precedes it. Therefore, pull¬ 
ing together a team who will give you honest, helpful 


feedback for the screening may require you to go outside 
of your organization. 

Remove yourself from a sugar-sweet or dysfunctional 
review environment. Instead, pull together a small group 
that has a similar profile as your target audience. They 
could be people in your industry like analysts, internal 
employees, trusted customers, or a focus group. Choose 
naysayers who will scrutinize, criticize, and challenge your 
perspective. You want them to be brutally honest when 
they tell you what they think. 

Each screener should have a printout of the slides and 
notes of your presentation so they can quickly jot down 
thoughts on the words and the visuals. Run through the 
entire presentation once and then revisit each section 
carefully. A solid review meeting should last about three 
times as long as the presentation itself. If your presenta¬ 
tion is twenty minutes long, for example, each screening 
should take an hour. If your presentation lasts an hour, 
the screening should take three. 



Find someone 
as far away from 
your twisted 
environment as 
possible to give 
honest feedback. 
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Give your test audience a safe environment to tell you what they really 
think. Solicit feedback in a nondefensive way and let them challenge 
all assumptions. Encourage them to tell you whether your presentation 
genuinely kept their interest. 

Don't ignore your screeners’ insights or give “yeah, but...” or “if they really 
knew...” excuses. Really listen to them and incorporate their insights. Then, 
rework your material. Screening the presentation will remove any burrs that 
would unintentionally snag or poke the audience with misunderstandings. 

These negative organizational systems, illustrated below, limit the qual¬ 
ity of your communication. If you work in any of these communication 
environments, go outside your organization to get honest, constructive 
feedback on your presentation. 



CONCEITED CAPTAIN: Leader 
engages late and forces team into 
time-crunched, low-quality output. 


\ FUTURE^ 
K.ISU. > 

UNich/owy 

POLITICAL PARANOIA: No one makes 
progressive decisions out of fear for 
their own destruction. 




VACUUM VISIONARY: There’s no room 
for alternative perspectives, and subject 
matter experts have no seat at the table. 


LACKEY LEADER: Indecisive leadership 
and flattery-driven consensus stalls 
strategy traction. 



MESSAGE MAGIC: In the absence of a 
strategy, imaginary messages become 
the norm. (See page 199) 



CUSTOMER COLD-SHOULDER: Self- 
focused communication is valued more 
highly than customer insight. 
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Case Study: Markus Covert, PhD 

Pioneer Award Winner 


Thousands of the nation’s top scientists apply for the 
Pioneer Award, sponsored by the National Institute of 
Health. The winners usually have high-risk, high-reward 
ideas that will transform how medical research is per¬ 
formed. Finalists travel to Maryland to deliver a fifteen- 
minute presentation followed by fifteen minutes of Q&A. 
Presentations are delivered to a panel of top-notch sci¬ 
entists who may or may not be in the same field of study, 
which means the ideas have to be conveyed in a way that 
resonates with a scientist in any field. 

Markus Covert, PhD, assistant professor of bioengi¬ 
neering at Stanford University, was the 2009 winner 
of a grant for $2.5 million. He strongly believes that 
the amount of feedback and practice he put into his 
presentation is what sealed the deal for him. 

Everything in Covert's presentation had to justify his 
hypothesis and warrant its funding. He needed to 
include the big picture but also dive deep in places to 
prove that he was knowledgeable. Covert challenged 
a long-standing scientific-communication tradition 
by incorporating emotional appeal in his presentation. 
He wanted the tone to be inspirational in addition to 
instructive—a brave departure from cerebral scientific 
traditions. Adding such a visceral layer was counter¬ 
intuitive; he knew even a thin veneer of emotional 


appeal would go a long way. So instead of focusing 
solely on how, he included content about why his project 
would change scientific research. 

Knowing his approach was risky. Covert rehearsed the 
presentation twenty different times with scientists in 
various disciplines. Twenty. Following his very scientific 
nature, he systematically presented over and over to sci¬ 
entists from various backgrounds, collected feedback, 
and modified the presentation to reflect their insights. 
At times, it was a struggle to add material while remov¬ 
ing pieces of which he’d become fond, but Covert took 
a humble stance and was determined to embrace and 
implement the feedback. 

It wasn’t until he reached the nineteenth and twentieth 
rehearsal that the feedback was “Don’t change a thing; 
it’s perfect!” It was a lot of rehearsing, but he knew the 
material, and delivery had to be just right. 

There is much about science that is inspiring. Often, the 
heartfelt passion gets buried in the facts and proofs. 

By including emotion in his presentation and rehears¬ 
ing it until it was just right. Covert won the grant. He 

now gets to spend time in his lab pursuing his passion 
instead of worrying where the funding will come from. 


Factoid: Covert is using his grant funds to pursue the Holy Grail of biology, which 
some have called the “ultimate test”—building a computer program that simulates 
an entire cell. If successful, his research could revolutionize our understanding and 
treatment of disease. 
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Model-driven discovery will 

REVOLUTIONIZE 

biological research. 


Effect of Research Breakthroughs on Life Expectancy 
Additional Years of Life Expectancy (Projected) 

Cancer 
Hiari Disease 

Cantor and Heart Disease 

Cancer. Stroke. Heart Disease, and Diabetes 

A41 Artardsfton 

_I_!_!_ 

5 10 IS 20 25 30 


Source Miller, Miltiank Quarterly, 20031 Average female in 1984181 yearsl 



Covert used a clean minimalist design for his slides. Many scientific 
presentations clutter the slides with too much data. His were 
perfectly balanced. 
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Case Study: Leonard Bernstein 

Young People’s Concerts 


Leonard Bernstein was a talented composer, conduc¬ 
tor, pianist, teacher, and Emmy-winning television per¬ 
sonality. He loved to talk about music and did so with 
everyone: friends, colleagues, teachers, students, and 
even children. Bernstein’s unique intelligence and wit 
afforded him a reputation as music’s most articulate 
spokesperson. 9 Variety magazine summed up his appeal 
by stating “The [New York] Philharmonic’s conductor 
has the knack of a teacher and the feel of a poet. The 
marvel of Bernstein is that he knows how to grab atten¬ 
tion and carry it along, measuring just the right amount 
of new information to precede every climax.” 10 

Of all the things Bernstein accomplished, leading the 
Young People’s Concerts was one of his proudest lega¬ 
cies. Several times a year, Carnegie Hall would fill with 
young children who came to learn about classical music. 
Bernstein would deliver a lecture-driven concert that 
could hold the attention of small children for an hour 
or more as he taught them complex music theory. The 
lecture-concerts were successful because Bernstein 
put the same energy and discipline into them that he 
put into his music, www 

Bernstein’s explanations, analogies, and metaphors 
were delivered in a clear, simple, yet poetic presentation 



that consistently stayed at the children’s understanding 
level. He isolated various layers of the music, explained 
the theory behind it, played excerpts of it on the piano, 
and used various instrumentalists to play portions of it. 
Then, when the full piece was performed, the children 
had a clearer understanding of the many nuances. 

Below are three excerpts from one of the most dif¬ 
ficult musical subjects to explain, “What is Symphonic 
Music?” Bernstein uses items familiar to the children 
as metaphors: 11 

• “How does development actually work? It happens 
in three main stages, like a three-stage rocket going 
into space. The first stage is the simple birth of the 
idea. Like a flower growing out of a seed. You all know 
the seed, for example, that Beethoven planted at the 
beginning of his [fifth] symphony, “dunt dunt dunt 
duuuunt.” Out of it rises a flower that goes like this: 
<plays piano>” 

• “[Brahms] puts two to three melodies together...and 
takes scraps of melodies and turns things upside 
down like pancakes. But it’s not that it’s upside down 
but that it sounds amazing upside down. Will it be 
beautiful? That’s what makes Brahms so great. Music 
doesn’t just change. It changes beautifully.” 

• “I’m hoping you’ll hear it with new ears and hear the 
symphonic wonders of it, the growth of it, and the 
miracle of life in it that runs like blood through its 
veins and connects every note to every other note 
and makes it the great piece of music that it is.” 

Bernstein worked for days on his Young People’s Concert 
scripts and rehearsed them several times so that when 
he was talking it would sound as if he were just having a 
calm, casual conversation with the children. 
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Bernstein put just as much rehearsal energy 
into his presentation scripts as he did into his 
musical scores. 
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Few of Bernstein’s viewers were aware of how much dogged work 
went into his presentations. He was so adept at displaying an easy, 
casual manner that his presentations appeared to be born effort¬ 
lessly and spontaneously. The truth, of course, was that he worked 
hard on his scripts. Weeks before, and right up to the last minute, his 
offices, house, and dressing rooms were filled with scattered piles of 
paper as he and his team wrote, planned, and rehearsed. 12 

Bernstein generated ideas on yellow legal pads and collaborated with 
his equally dedicated co-workers until a graceful, accessible script 
was formed. 13 The team would make sure each metaphor and alle¬ 
gory was appropriate for the audience. Bernstein himself would walk 
through the script several times, marking and rehearsing as he went. 

Bernstein and his team edited constantly, right up to the moment 
he walked on stage. After each show, they would watch the record¬ 
ing of what he said and evaluate it to improve it the next time. He'd 
identify improvements he could make so he didn’t commit the same 
mistakes over and over. While all good conductors review their con¬ 
certs, Bernstein applied this practice to his presentations as well so 
that each one got better than the last. 

Conductors are trained to have a disciplined rehearsal process, so 
editing a script through multiple iterations wouldn't be a foreign 
process for them. They read a musical score the way most people 
read a book. Paging through Bernstein’s scores is like watching him 
rehearse. He studied and reviewed a score several times, working 
hard to represent the composer’s intent. He had a special pencil that 
he used while reading scores that he called his “red-ee blue-ee" (one 
end wrote in red pencil, the other in blue). As he continued through 
the score, he flipped the pencil back and forth as he thought about 
the expression of the music from his point of view as the conductor 
or that of the individual musicians (his audience). 14 

The blue markings were conductor markings for Bernstein him¬ 
self that helped identify phrasing, instrumental cues, and musical 






emphasis. The red markings were notes to the musicians that would 
be transferred to each of their specific parts. These markings are 
particularly interesting. He was a literary conductor that didn’t just 
draw attention to a marking; he poetically described what he wanted 
the musician to feel. John Cerminaro, who played for years in the New 
York Philharmonic’s horn section, said, "You couldn’t just play a solo 
according to the notes on the page; [Bernstein] wanted something 
special on an emotional level every time.” 15 

Bernstein tried to anticipate everything while he rehearsed and 
refined his presentation scripts. He planned every word and audi¬ 
ence reaction carefully. He developed his scripts to the point of 
anticipating multiple audience responses—even writing alternate 
sections based on how people might react to the previous point. He 
even made notations of where and how he would stand while on the 
stage. The New York Philharmonic archives contain copies of scripts 
that show as many as ten revisions (in addition to the rounds on his 
yellow pads), which is a reflection of the thoroughness of Bernstein’s 
thought process and rehearsals. 16 

Bernstein wrote about his Young People’s Concerts experience in 
1968 using words that can stand as his credo. “These concerts are 
not just concerts—not even in terms of the millions who view them 
[on TV] at home,” he wrote. "They are, in some way, the quintessence 
of ail I try to do as a conductor, as a performing musician. There is a 
lurking didactic streak in me that turns every program I make into a 
discourse, whether I utter a word or not; my performing impulse has 
always been to share my feelings, or knowledge, or speculations about 
music—to provide thought, suggest historical perspective, encourage 
the intersection of musical lines. And from this point of view, the Young 
People’s Concerts are a dream come true, especially since the sharing 
is done with young people—that is, people who are eager, unpreju¬ 
diced, curious, open, and enthusiastic.” 17 

Regardless of your subject matter, passion and practice make perfect. 


This excerpt from the script for “Humor in Music” 
shows how carefully Bernstein and his team planned. 

22 

BERNSTEIN (CONT'D) 

In music composers can make these 
surprises in lots of different wayB - by 
making the music loud when you expect it 
to be soft, or the other via y around; or 
by suddenly stopping in the middle of a 
phrase; or by writing a wrong note on 
purpose, a note you don't expect, that 
doesn't belong to the music. Let's try 
one, just for fun. You all know those 
silly notes that go - 
SING ; SHAVE AND A HAIRCUT - 2 BITS 

O.K. Now you sing "Shave and a Haircut”, 
and the orchestra will answer you with 
'2 bits" and see what happens. 

2E£H: ILLUSTRATE 

(IF NO LAUGH) 

Now, you see, you didn't laugh out loud. 

(IF LAUGH) 

Now most people don't laugh out loud about 
musical jokes. That's one of the things 
about musical humor: you laugh inside. 
Otherwise you could never listen to a 
Haydn symphony: the laughter would drown 
out the music. But that doesn't mean a 
Haydn symphony isn't funny. 

ag (MORE) 
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Practice makes perfect—kind of. An old adage says, “if anyone does not 
stumble in word, he is a perfect man." And no one is perfect. There is 
always room to improve. So be tenacious in preparing yourself ahead of 
time. Rehearse and re-rehearse. Then afterward, solicit feedback—and if 
it was taped, review the recording and then start the refinement process 
all over again. 

Successful people plan and prepare. To be successful in any profession 
requires discipline and mastery of skills. Applying that same discipline 
to the skill of communication will attach the audience to your idea and 
improve your professional trajectory. 
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Changing the World Is Hard 


If you say, “I have an idea for something,” what you 
really mean is, "I want to change the world in some 
way.” What is “the world” anyway? It is simply all of 
the ideas of all of our ancestors. Look around you. Your 
clothes, language, furniture, house, city, and nation 
all began as a vision in someone else’s mind. Your food, 
drink, vehicles, books, schools, entertainment, tools, 
and appliances all came from someone’s dissatisfac¬ 
tion with the world as they found it. 1 Humans love 
to create. And creating starts with an idea that can 
change the world. 

Staying passionate and tenacious about your idea 
requires that some part of you be uncomfortable with 
the status quo. At times, you must have enough resolve 
to put your reputation on the line for the sake of advanc¬ 
ing your idea. It's scary to go out on a limb and approach 
others with a product, philosophy, or ideal that you pas¬ 
sionately support. Some will challenge it, and some will 
reject it. And that’s hard. Society doesn’t reward rejects, 
but it does reward those who have the tenacity to keep 
going after being rejected. So don’t give up. 

My husband and I collect large, oversized vintage post¬ 
ers. Once, while on vacation with our kids, we stopped 
in to see one of the poster dealers. He wore white cot¬ 
ton gloves as he carefully turned over each table-sized 
poster. When he turned to the poster on the right, both 
my kids gasped and said, “Oh my gosh, Mom. It’s you! 
You have to buy that poster.” Hmmm. Was it a good 
thing they felt that way? 


The poster is an old French advertisement for baking 
spices. Baking spices! It’s comical to see how fired up 
this gal is to promote her little collection of spices. But 
if I were to replace her pack of spices with the words 
“effective presentations,” I guess this is me. I get pretty 
fired up. 

Ideas are not really alive if they are confined to only 
one person’s mind. Your idea becomes alive when it is 
adopted by another person, then another, and another, 
until it reaches a tipping point and eventually obtains a 
groundswell of support. 

President Kennedy gave a speech declaring that by the 
end of the decade, the United States should land a man 
on the moon and bring him home safely. He wanted 
support from every American. He said in the speech, 

“In a very real sense, it will not be one man going to the 
moon—it will be an entire nation. For all of us must work 
to put him there.” He wanted the entire country to feel 
responsible for supporting his vision. Later in the 1960s, 
JFK was touring NASA headquarters and stopped to 
talk to a man with a mop. The president asked him, 
“What do you do?" The janitor replied, "I’m putting the 
first man on the moon, sir.” This janitor could have said, 

“I clean floors and empty trash.” Instead, he saw his role 
as part of the bigger mission that was to fulfill the vision 
of the president. As far as he was concerned, he was 
making history. 2 

“The only reason to give a speech is to change the world.” 

John F. Kennedy 
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“There is one thing stronger 
than all the armies in the 
world, and that is an idea 
whose time has come.” 

Victor Hugo 3 
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Use Presentations to Help Change the World 


Presentations really can change the world. Who would 
have thought that a movie about a presentation would 
win an Academy Award, create global awareness, and 
incite change? Long before An Inconvenient Truth was 
on anyone’s radar, former Vice President Al Gore had 
delivered his presentation hundreds of times to influen¬ 
tial audiences around the world. In fact, he’d been deliv¬ 
ering a similar presentation back as far as the 1970s. 

You might not need to change the entire world, but 
you can definitely change your world using a presenta¬ 
tion. Many of the people featured in this book delivered 
presentations over and over. They didn’t just present 
once and call it a day. Their lives were spent constantly 
communicating their visions. 


To see a systemic adoption of your idea, you may have to 
deliver multiple presentations. On your way to change the 
world, there will be key communication milestones that 
become catalysts for your success. Each milestone is an 
opportunity to adjust the strategy, collaborate, and realign 
the team. The brilliant discussions that occur when pulling 
a presentation together sometimes have as much value as 
the presentation itself. 

Below are only a few of the milestones in a product launch 
that include a presentation. Each one represents a critical 
communication stage in the product’s life cycle that is 
usually conveyed through a presentation. 


Presentations play a valuable role throughout the life cycle of a product 


Create 
a unique 
idea 



Socialize 
your idea 


Research 
and validate 
your idea 



Develop 

execution 

plan 


Request 
ed funding 



Launch 

product 


Update Host 

Pitch board earnings 


product 



Brief 

analysts 


members 

Go public 


calls 

Deliver 
profound 
keynote 
addresses 


“If a business is really a decision factory, then the 
presentations that inform those decisions determine 
their quality.” 
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Marty Neumeier 4 


ORIGINAL IDEAS 


PRESENTATION 


An understanding of the strategic value of a presentation is 
important to your career. Make sure your world-changing 
ideas are in your organization’s presentations. If not, you'll 
inherit someone else’s thinking and implement their ideas 
instead of influencing innovation with yours. 



ACTIVITIES 

After the ideas are presented and 
agreed to, work activities are generated 
from the presentations. Most presenta¬ 
tions persuade people to take action, so 
presentations spawn a lot of activity. 



MEDIA 

Also, after the brilliant thinking in the 
presentation is solidified, it ripples 
through and informs other related 
materials needed to support and 
spread the idea like web sites, social 
media, brochures, and so forth. 


Remember, just because you com¬ 
municated your idea once doesn’t 
mean you’re done. It takes several 
presentations delivered over and 
over to make an idea become real¬ 
ity. Well-prepared presentations will 
speed up the adoption and change 
your world! 
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In mid-August 2001, Ken 
Lay presented this exact 
slide at an employee meet¬ 
ing to assure them that 
2001 was fine and 2002 
would be even better. Enron 
was valueless by the end of 
2001. (Slide courtesy of the 
Department of Justice.) 






Don’t Use Presentations for Evil 


Anyone who takes even a quick look at the hundreds of 
slides submitted as evidence in the Enron trial will see 
that presentations can play a conspicuous role in the 
perpetration of lies. Presentations are a powerfully per¬ 
suasive tool that should be used for good and not evil. 

Jeff Skilling (chief executive officer), Kenneth Lay (chair¬ 
man of the board), and Richard Causey (chief account¬ 
ing officer) all had several counts against them based 
on their presentations. All three were charged with ten 
counts each for their earnings-call presentations, and 
Ken Lay was charged with two counts for employee pre¬ 
sentations. Because their presentations were transmit¬ 
ted into different states via phones and web technology, 
these executives were also charged with the federal 
crime of wire fraud. In fact, Skilling was sentenced to 
fifty-two months for each of the five counts involving 
presentations for which he was convicted. 5 

Presentations got these executives into this mess, while 
the right presentation could have prevented it altogether. 

• Scandal started with a presentation: Andrew Fastow, 
Enron’s chief financial officer, was the mastermind of 
clever accounting who used "special purpose entities” 
to hide billions of dollars and ultimately line his pockets 
with over $45 million. According to USA Today, Fastow 
gave “a slick presentation on the LJM partnerships” 
(the organization created to hide debt), and the "Enron 
managers and analysts stared at each other in confu¬ 
sion. It sounded too good to be true.” 6 A slick presenta¬ 
tion by a slick villain lured them into this mess. 

• Scandal could have been prevented with a presenta¬ 
tion: A detailed presentation given by Arthur Andersen’s 


David Duncan in February 1999 feebly warned the Enron 
Board of Director’s audit committee of the company’s 
risky accounting practices. This presentation could have 
saved Enron. If Duncan had boldly built a slide in capital 
letters saying “ENRON HAS RISKY PRACTICES THAT 
NEED INVESTIGATION,” its demise might have been 
avoided. Instead, Duncan’s notes found on the margin 
of his dense slide presentation said, "Obviously, we are 
on board with all of these [risks].” 7 

Enron’s top executives played by their own rules. They 
made risky bets motivated by greed and ambition. The 
collapse was inevitable. As masters of the PowerPoint 
chart, they showed upward projections for sales and 
profits, encouraging employees to invest while they 
themselves were frantically removing their own money. 
Employees who raised questions were mysteriously moved 
to other departments. Skilling distracted investors by pro¬ 
posing bold strategies for the next big score, like entering 
the broadband and weather futures markets. (What’s an oil 
company doing brokering weather, anyway?) 

They aggressively designed communication that 
abandoned reason and truth altogether, and they used 
presentations as a propaganda device to spread lies to 
employees, analysts, and stockholders about Enron’s 
performance. In the ensuing collapse, the credibility of the 
board and the executives involved was obliterated, and 
tens of thousands of employees were financially ruined. 

Oral communications have built and toppled kingdoms. 
Presentations are a powerfully persuasive medium that 
should be used to build up—not tear down. 


Change Your World 199 


Enron’s Presentations During Implosion 

Out of the thousands of presentations given at Enron annually, many had direct implications in 
Enron's demise. This chart highlights several that played a role in the scandal. 
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i Enron lawyer sends memo questioning LJM partnership. 

i Skilling hit in face with pie at energy crisis presentation in California. 

i At analyst road show presentation, Lay/Skilling state Enron has “never 
been stronger.” 

i Skilling's employee presentation assures them the bottom line is great. 

A few hours later, layoffs are announced. 

i Whistle-blower Sherron Watkins asks Lay to address issues in an employee 
presentation. He doesn't. 10 

1 Lay leads Internet presentation with employees, tells them he’s buying stock, 
and encourages them to do the same. 

i Lay presents at energy summit to policy makers asking for more deregulation 
so Enron and the country will flourish. 11 

External board is told of Raptor losses but not whistle-blowing. Directors 
leave presentation thinking Enron is fine. 12 

AA tells colleagues via video presentation to destroy unneeded records. 

Lay repeats rosy prognosis in conference call with securities analyst after 
disclosing $1 billion+ losses. 

Lay uses PowerPoint to show ever-rising revenue to analyst and fund managers. 

As Lay presents at a managers’ meeting that Enron’s “liquidity is strong,” 
BlackBerries notify audience of SEC investigation. 

Lay conducts live webcast to analysts, saying, “We’re not hiding anything.” 
Enron publicly announces they overstated profits for five years. 


“A fortune made by a lying tongue is a fleeting vapor and a deadly snare.” 

Proverbs 21:6 
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Gain Competitive Advantage 


In life, someone’s always winning, and someone’s always 
losing. This isn’t just true in commerce. Even beliefs and 
values go through seasons of victory and defeat. There’s 
a constant push and pull related to what’s perceived as 
right or wrong, based solely on how it’s communicated. 

Most communicators are visionaries who can see where 
to go and how to get there. An executive “sees” where 
the company needs to go; a manager “sees” how to 
build a strategy; an engineer “sees” how to construct a 
product; and a marketer “sees” how to promote it. Even 
our social causes are “seen” before they are solved. Your 
job as a communicator is to get others to “see” what 
you are saying so your ideas gain traction. If you can 
do that, you win. 


I recently had dinner with a friend who works for a top 
international consulting firm. His group was compet¬ 
ing for a multimillion-dollar piece of business against 
another leading firm. They pulled together their smart¬ 
est team and delivered a brilliant presentation, and they 
were shocked to get the news that they didn’t win the 
business. The reason? Even though the client confirmed 
that my friend’s firm was smarter, the client couldn’t 
understand the findings. Their brilliance was obscured 
by dense, smart-sounding slides. My friend’s firm did 
smarter work; however, the other firm communicated 
its findings in a way that was useful. All the smarts in 
the world are useless if they are not understood. 
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Getting stakeholders to understand your ideas—while the 
competitor’s concepts remain obscure—ensures a victory. 

If presented well, a smart idea acts as the igniting spark 
for an explosion of human and material resources. A great 
presentation gives smart ideas an advantage. 

If your presentation is great, it can become a broadly 
reaching social media phenomenon. Now, more than any 
other time in history, great presentations transcend the 
moment in which they're given, because they can be seen 
by millions of people who weren't there in person. Your 
presentation can be viewed over and over again, long 
after you gave it. Randy Pausch’s last lecture has been 


seen on YouTube over 12 million times. The website 
TED.com hosts eighteen-minute-long presentations and 
has had over 100 million views. Martin Luther King Jr.'s 
"I Have a Dream” speech has been watched over 15 mil¬ 
lion times on YouTube. Those numbers are big enough 
to start a movement. When a presentation is great and 
is recorded, people will watch it again and again. 

If your message is clear and worth repeating, it will be 
repeated. If your message is repeated, you win! Sound 
simple? It is. 
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Case Study: Martin Luther King Jr. 

His Dream Became Reality 


Martin Luther King Jr. was one of the greatest orators 
and civil rights activists in U.S. history. His goal was to 
end racial segregation and discrimination using peace¬ 
ful means. 

King delivered his electrifying “I Have a Dream” speech 
from the steps of the Lincoln Memorial during the 1963 
March on Washington, which became the flash point for 
a movement. 

Insights from “I Have a Dream”: 

The sparkline on the next few spreads includes a full tran¬ 
script of the speech to help identify the following insights: 

• Contour: King’s speech moves between what is and 
what could be rapidly, which is an appropriate pace 
for the heightened energy of the gathering. 

• Dramatic pauses: We put a line break in the sparkline 
each time he pauses. As you’re reading it, breathe for 
a second or two at the end of each line to get a sense 
of how it was spoken. 

• Repetition: King uses the rhetorical device of repeti¬ 
tion often. Throughout the speech, he repeats word 
sequences to create emphasis. Toward the end, he 
repeats the phrase “I Have a Dream” several times, 
like the refrain of a hymn. 

• Metaphor/visual words: King masterfully uses 
descriptive language to create images in the mind. 
For example, he states, “Now is the time to rise from 
the dark and desolate valley of segregation to the 
sunlit path of racial justice.” 


• Familiar songs. Scripture, and literature: King establish¬ 
es common ground by referencing many spiritual hymns 
and Scriptures familiar to the audience. He even rephras¬ 
es a small sequence from Shakespeare: “This sweltering 
summer of the Negro’s legitimate discontent will not pass 
until there is an invigorating autumn...” 

• Political references: King pulls lines from several politi¬ 
cal resources like the U.S. Declaration of Independence, 
the Emancipation Proclamation, the U.S. Constitution, 
and the Gettysburg Address. 

• Applause: There are varying degrees of applause 
throughout ranging from clapping to clapping with loud 
cheering. In the sixteen-minute speech, the audience 
applauds twenty-seven times. That’s applause approxi¬ 
mately every thirty-five seconds. 

• Pacing: King speeds up and slows down to vary the 
quantity of words spoken per minute. This creates three 
distinct bursts or crescendos in his speech that build 
to the passionate ending that describes the new bliss. 

King’s speech heightened the awareness of civil rights 
issues across the country, bringing more pressure on 
Congress to advance civil rights legislation and end racial 
segregation and discrimination. 

In 1963, King was named Time magazine’s Man of the Year. 

A short forty-six years later, the United States elected its 
first African American president, Barack Obama. 

Great communicators create movements. 

Listen along at www as King delivers his speech. 
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to dramatize a shameful condition. 
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I still have a dream. 
It is a dream deeply rooted in the American dream. 

I have a dream 


that one day on the red hills of Georgia 

the sons of former slaves and the sons of former slave owners 
will be able to sit down together at the table of brotherhood. 
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Case Study: Martha Graham 

Showed the World How She Felt 


Although primarily known as a dancer, Martha Graham 
was also a powerful communicator. She developed char¬ 
acteristics that anyone who aspires to become a great 
presenter must cultivate and nourish. She stood out by 
moving against the grain of society. She persevered in 
spite of seemingly overwhelming obstacles. She fought 
against and overcame her fears. She respected and con¬ 
nected deeply with her audience. And she never held 
back from communicating her deepest feelings. 

Graham spent her life challenging what dance is and 
what a dancer can do. She looked upon dance as an 
exploration, a celebration of life, and a religious calling 
that required absolute devotion. 13 

Graham became a dancer against the odds. She grew 
up in an environment where dance was frowned upon 
as a career. When she finally began to study dance with 
the idea of making it her profession, she was consid¬ 
ered too old, too short, too heavy, and too homely to 
be taken seriously. "They thought I was good enough 
to be a teacher, but not a dancer," she recalled. But she 
knew what she wanted to do and pursued her goal with 
the intensity that marked her entire life. Dance was her 
reason for living. Willing to risk everything, driven by a 
burning passion, she dedicated herself absolutely to her 
art. “I did not choose to be a dancer,” she often said. “I 
was chosen.” 14 

To Graham, traditional European ballet seemed deca¬ 
dent and undemocratic. Classical ballet dated back 


more than three hundred years, when it originated as an 
elegant spectacle in the royal courts of Europe. Ballet 
was a highly controlled dance form, characterized by 
grace and precision of movement—but not freedom 
of expression. 

Graham was ready to discard traditional ballet. She 
invented a revolutionary new language of dance, an 
original way of moving with which she revealed the joys, 
passions, and sorrows common to human experience. 
In place of graceful soaring leaps through space, she 
introduced stark, angular movements, blunt gestures, 
and stern facial expressions as she sought to lay bare 
fundamental human moods and feelings. Her dances 
were meant to be challenging and disturbing. 15 

This new kind of dance wasn't to everyone’s liking, as it 
was neither beautiful nor romantic. Graham was often 
the object of ridicule and the butt of hostile jokes. 
Women in America had won the right to vote only a few 
years earlier, in 1920, and many people were still uncom¬ 
fortable with the image of the “new woman” who sought 
a career and voted. It was acceptable to be a high- 
kicking, scantily clad chorus girl, but a woman who ran 
a dance company and created works that commented 
on war, poverty, and intolerance seemed unnatural 
and suspicious. 16 

She was protesting. Stark. And American. Some called 
her ugly, others called her revolutionary. But Graham 
was resolute in her desire to communicate how she felt. 
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Graham believed that the secret, emotional world 
made visible by a dancer’s movement could not always 
be expressed in words. She wanted her dances to be 
“felt” rather than "understood" 17 Graham drew inspira¬ 
tion from the ugly side of life and put it on display. Each 
of her dances had a special significance to her, because 
they expressed a fear she had conquered in her own life. 

In 1930, Graham premiered a haunting solo dance of 
mourning called Lamentation. w\nw These rare photos 
show her sitting on a low bench, wearing a tubelike 
shroud with only her face, hands, and bare feet showing. 
In the dance, she began to rock with anguish from side 
to side, plunging her hands deep into the stretchy fabric, 
writhing and twisting as if trying to break out of her own 
skin. She was a figure of unbearable sorrow and grief. 
She did not dance about grief but sought to be the very 
embodiment of grief. 

Graham recalled, “One of the first times I performed it 
was in Brooklyn. A lady came back to me afterwards 
and looked at me. She was very white-faced and she’d 
obviously been crying. She said ‘you’ll never know 
what you have done for me tonight, thank you’ and left. 

I asked about her later and it seemed that she had seen 
her nine-year-old son killed in front of her by a truck. 
She had made every effort to cry, but was unable to. 
But when she saw Lamentation she said she felt that 
grief was honorable and universal and that she should 
not be ashamed of crying for her son. I remember that 
story as a deep story in my life that made me realize 
that there is always one person to whom you speak in 
the audience. One.” 18 












Graham moved in a way that gave anger and grief back 
to her audiences. She had a genius for connecting move¬ 
ment with emotion. She could make visible ail those feel¬ 
ings that people have inside them but can’t put to words. 

Communicating in any medium is hard work. Graham's 
dances did not come easily to her. When the idea for a 
new dance was starting to take form, it was “a time of 
great misery." Graham worked late into the night, propped 
up in bed, writing down thoughts, observations, impres¬ 
sions, quotations from books—anything that could help 
feed her imagination. “I would put a typewriter on a little 
table on my bed, bolster myself with pillows, and write 
all night.” 19 

She read widely as she searched for ideas and inspira¬ 
tion, studying psychology, yoga, poetry, Greek myths, 
and the Bible. Gradually, the ideas that filled her note¬ 
books would begin to reveal a pattern, and she would 
write out a detailed script. 20 

In her work, Graham repeatedly portrayed a woman 
called to a high destiny and forced to overcome fear 
before she could answer the call. This was personal, as 
Graham herself believed that she had been given “lonely, 
terrifying gifts”—a sort of divine command to penetrate 
the interior of the human spirit, no matter what comfort¬ 
less truths she might find there. 21 

In 1955, the U.S. government asked Graham to tour 
major cities in seven countries as a cultural ambassador. 
She gave lectures at each stop but was a very nervous 
presenter. In the biography, Martha, author Agnes de 


Mille describes the scene. “She hung onto the barre, 
clung to the walls. She couldn’t think what to do with 
her hands, with her robes, with her feet.” Finally, she 
escaped into her dressing room and locked the door. 22 
But Graham tried again and again, and she overcame her 
fear. Eventually, the State Department officials named 
Graham “the greatest single ambassador we have ever 
sent to Asia.” 23 

Until she was ninety, Graham continued to deliver 
lectures, which she had developed into an art form. A 
striking figure with a seductive voice, poetic insights, 
and a faultless sense of timing, she learned how to hold 
an audience spellbound. 24 

You could say that by trying to discover herself, she 
founded the world of modern dance. During her long 
journey, she invented a new way of moving, a unique 
dance language that has thrilled audiences all over the 
world and enlarged our understanding of what it means 
to be human. 25 

All of us are unique. We each have our own pattern of 
creativity, and if we do not express it, it is lost for all 
time. Graham defied customs, broke through barriers, 
and presented new ideas. She was loved and reviled, yet 
persistent in overcoming her fears to communicate what 
she felt in her soul. By remaining committed to commu¬ 
nicating how she felt, she changed dance for all time. 
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Be Transparent So People See Your Idea 


You must be willing to be you, to be real, and to humbly 
expose your own heart if you want the people in the 
audience to open theirs. You must be transparent, 
and this is difficult. Standing in front of an audience 
is already a challenge in itself. When stage fright is 
compounded with the new demand on leaders to be 
transparent, it's downright terrifying. 

Being transparent moves your natural tendency of per¬ 
sonal promotion out of the way so there’s more room 
for your idea to be noticed. The audience can see past 
you and see the idea. 

There are three keys to being transparent: 

• Be honest: Be honest with the audience and give 
them the authentic you. You’re not perfect; they 
understand that. If you are honest with yourself 
and with them, your presentations will have more 
moments of vulnerability and sincerity. It’s not 
honest to present yourself as the almighty-powerful- 
know-it-all-who-has-no-flaws. If you’re genuine, 
your humanness will come through. This means shar¬ 
ing stories that open your listeners’ hearts, sharing 
how you’ve failed and how you’ve overcome, and 
letting people in to see that you’re real. Openly 
sharing moments of pain or pleasure endears you 
to the audience through transparency. 


“Being true to yourself involves showing and sharing 
emotion. The spirit that motivates most great storytell¬ 
ers is ‘I want you to feel what I feel,’ and the effective 
narrative is designed to make this happen. That’s how 
the information is bound to the experience and 
rendered unforgettable.” 

Peter Guber 26 

• Be unique: No two people have experienced the exact 
same trials and triumphs in life. During your lifetime, 
you’ve collected stories and feelings that no one else 
has. It’s those differences that make you interesting. 
Though we often tend to conceal our differences in an 
effort to fit in and be accepted, our unique perspec¬ 
tive is what brings new insights to a topic. Share your 
ideas and be okay with the fact that sometimes you’re 
the only one who sees what you see. 

• Don’t compromise: If you really believe in what you’re 
communicating, speak confidently about it and don’t 
back down. It’s scary to be ridiculed or rejected, but 
sometimes that’s the price. It won’t be easy to try 
something no one has done before or speak loudly 
about a topic that no one has the guts to confront. Be 
encouraged by the child in the story of “The Emperor's 
New Clothes” who had the guts to say what was really 
going on and, in doing so, shattered the pretenses of 
the entire royal court. Call it like it is. 
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You Can Transform Your World 


Whether your opportunity to convey your passion 
comes through work or other activities, there will be a 
moment in your life when making an idea clear will play 
a significant role in shaping who you become—and the 
legacy you leave behind. 

Your ideas may be simple, or they could contain the 
keys that unlock unknown mysteries. However, if you 
don’t communicate them well, they will lose their value 
and add nothing to humanity. 

The amount of value you place on your idea should 
be reflected in the amount of care you take in commu¬ 
nicating it. Passion for your idea should drive you to 
invest in its communication. 

In this book, we’ve looked at several people whose 
presentations changed the status quo. Their commu¬ 
nications graced the world and made it a better place. 
These presenters came from different faiths, different 
walks of life, and different passions; yet all of them 
made the personal investment required to communi¬ 
cate effectively and change their world. When looking 
at the profound impact they’ve all had, it’s easy to tell 
yourself that you won’t measure up to their standards 
because presenting came naturally to them and doesn’t 
come naturally to you. That’s simply not true. 

The people featured in this book invested many hours 
in their presentations and agonized over the words, 


structure, and delivery of their ideas. Their presentations 
didn’t come easily to them, but they were all committed to 
conveying their perspective in a way that was successfully 
persuasive. Some even risked their lives for their ideas. 

If you aren't inspired by what you do—or if you don’t have 
a message to convey that you’re passionate about—find 
your calling. This book looked at a motivator, a marketer, 
a politician, a conductor, a lecturer, a preacher, an execu¬ 
tive, an activist, and an artist. They all had their own well 
of inspirational ideas and their own unique way of com¬ 
municating them. You can too. You just need to find what 
inspires passion in you. Then you must apply the same 
discipline to communicating it as musicians or dancers 
apply to their craft. 

Nowadays, more than any other time, people are eager 
for inspired ideas that stand out and are worth believing 
in. There’s so much disingenuous noise in our culture that 
when an idea is presented with sincerity and passion, it 
stands out and resonates. 

We were born to create ideas; getting people to feel like 
they have a stake in what we believe is the hard part. 

It doesn’t seem fair that an idea’s worth is judged by how 
well it’s presented, but it happens every day. So, if you 

can communicate an idea well, you have, within you, the 
power to change the world. 
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Case Study: Wolfgang Amadeus Mozart 

Be Flexible Within the Form 


Classical music includes a structure called the sonata form, 

which is similar to the presentation form. A sonata has 

standard “rules” to follow, yet each sonata sounds unique. 

Sonatas don’t come across as contrived or formulaic, and 

we can draw inspiration for our presentations from that. 

Structure in the Three-Part Sonata Form 

Structure enables listeners to anticipate what comes next. 

The sonata form has three parts: 

O Beginning (exposition): Musical themes are introduced 
and usually repeated so the listener can identify the 
central musical idea. It’s important that the listeners 
thoroughly understand the initial theme, so they can 
recognize it when it’s modified (creating an identifi¬ 
able gap between what is and what could be). 

c Middle (development): The musical theme is altered 
and riffed off of. This is the most exciting part of the 
piece, because the listeners are intrigued by how the 
composer modifies the central idea. The listeners can 
hear the tension between what the theme was in the 
beginning and what it has become during the devel¬ 
opment. There is an element of surprise. 

0 End (recapitulation): After the ideas are modified in 
the development section, the piece transitions back 
to the original theme. There is a special feeling when 
that theme is restated after its modification during 
the development section. 

Sonata Form 

EXPOSITION DEVELOPMENT RECAPITULATION 

ABA 


Contrast Keeps Things Interesting 

Contrast keeps a presentation interesting. The same 
is true with music. 

O Tonal contrast: Put simply, tonal contrast is key changes. 
Ben Zander mentions in his presentation (page 48) 
that music has a “home” or place of resolution for 
which it longs. That home is the tonic key. The beauty 
of harmony is that the human ear recognizes when we 
are away from home and when we are home. 

0 Dynamic contrast: Dynamic contrast is created when 
the music alternates between loud and soft. Sometimes 
the transition is sudden, while other times it is gradual. 

0 Textural contrast: 

a. Polyphony/Monophony —Throughout the piece there is 
always a clear melodic line. Sometimes all the instruments 
play the same melody in unison (monophony), and other 
times one instrument plays the melody while the others 
complement and accompany the melody (polyphony). 

b. Density —The number of notes played per measure 
determines the density. Sometimes there are only a few 
notes per measure, while at other times there are many, 
often being played at the same time. 

The foundation for an interesting sonata is that it has 
contrast in varying layers, similar to a presentation. Just 
like a great sonata, a great presentation should follow 
the structure of the presentation form yet be flexible 
within its constraint. As the composer of your presenta¬ 
tion, you need to create dramatic contrast to keep the 
audience’s interest piqued. 

Each of the numbered 
items in blue circles above is 
represented in the sparkline 
on the following pages. 


a | b | c | 

a be | 

1 a 

b 

1 c 

Tonic Dominant Dominant 

Foreign Keys 

Tonic 

Tonic 

Tonic 
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Tonic O Dominant 


Sonata Sparkline 


Below is my son’s analysis of the structure and contrast in the first movement 
of Mozart's Eine kleine Nachtmusik. You can see the clear structure of the 
beginning (1), development/middle (2), and recapitulation/end (3). The most 
important contrast in music is the tonal contrast (4). Also notice how extensive 
the other two forms of contrast are, (5) and (6). Contrast is important. 

No two sonatas are alike because great composers know how to work flexibly 
within the form. For your inspiration, there are sonatas visualized and set to 
music on this book's web site, www 


o 


Exposition (Beginning) 


Exposition (Repeat Beginning) 



0:00 0:33 0:50 1:41 2:13 2:31 



"Texture contrast is represented by color and bar height. Yellow represents 
musicians playing in unison, blue represents each playing something differ¬ 
ent, and green is a blend of the two. The height of the bars represents the 
density of the music. Short bars represent fewer notes per measure (typical 
of slow music), and longer bars represent more notes per measure (typical 
of fast music). 
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A coda is additional material 
that’s played after the reca¬ 
pitulation has ended. Steve 
Jobs’s presentations often 
have “codas." Just when you 
think he’s unveiled everything, 
he has an “Oh, wait! There’s 
more!” moment. 
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Case Study: Alfred Hitchcock 

Be a Collaborative Visionary 


The presenter is the public persona of a single individual, 
but in reality, the best presentations result from the 
collaborative efforts of an empowered team behind 
the scenes. 

Alfred Hitchcock controlled the central creative aspects 
of his films, but he relied heavily on his team in their cre¬ 
ative development and production. Ideas were written 
and drawn before they were filmed. Hitchcock worked 
with a screenwriter to develop a written framework (the 
script). He then worked with a production designer to 
create a visual framework (sketches and storyboards ). 2 

• Written Framework: To Hitchcock, the real creative 
work on a film took place in the office of the writer. 
“We went into a huddle and slowly from discussions, 
arguments, random suggestions, casual desultory talk, 
and furious intellectual quarrels as to what such and 
such a character in such and such a situation would or 
would not do, the scenario began to take shape" 3 

Without a doubt, Hitchcock brought out the very best 
in his writers. They created absorbing stories, devel¬ 
oped interesting characters, and provided compelling 
dialogue. Combined with Hitchcock’s direction, the 
result was a body of work unmatched in the cinema. 4 

• Visual Framework: Hitchcock constantly visualized 
his movies. He began with a story or idea and moved 
quickly to develop a look for the film. Each step in 
the process—costume design, production design, set 
design, visual effects, written scene descriptions, shot 
lists, storyboards, and camera angle drawings—included 
a conversation with the appropriate department heads. 
Hitchcock's collaborators usually took one of the direc¬ 
tor's suggestions and expanded upon it, integrating their 
ideas into the collective process. Hitchcock envisioned 


his films in detail before the camera began to roll. 5 When 
interviewed by French film director Francois Truffaut in 
1962, Hitchcock boasted, “I never look at a script while 
I’m shooting. I know the whole film by heart.’’ 6 

Actress Janet Leigh described his modus operandi: “In 
his mind, and sketched on the pages of his script, the 
film was already shot. He showed me the model sets 
and moved the miniature camera through the tiny fur¬ 
niture toward the wee dolls, exactly the way he intended 
to do in the “reel” life. Meticulously thorough down to 
the minutest detail.” 7 

The process of creating a movie is a highly collaborative 
one in which each person involved brings a layer of 
value. The better we understand the creative process 
behind motion pictures, the better we can understand 
the creative process behind an effective presentation. 

Great leaders honor the people behind the curtain that 
help with their presence on stage. Leading requires that 
you bring out the best in the team supporting you. Use 
their strengths and talents to build on your ideas. Be 
open to modifying your vision by embracing the unique 
value they bring to the project. 

Even though Hitchcock was the one in the spotlight, he 
let many others influence his work. 


My dad, to whom this book is dedicated, was a 
contributor to Alfred Hitchcock’s Mystery Magazine. 

I’ve posted his short stories online, www 
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Case Study: E. E. Cummings 

Break the Rules 


E. E. Cummings was an American poet, painter, essayist, 
author, and playwright. He graduated from Harvard magna 
cum laude and then continued (still at Harvard) to get a 
Master’s degree in English and Classical Studies. He loved 
writing, so to become a better writer, he signed up for 
an advanced composition class where his teacher taught 
him how to make his writing clearer, more precise, and 
less wordy. Cummings practiced writing until his wrist 
hurt. 9 Even though he was considered an avant-garde 
poet, much of his work falls within traditional poetic 
forms. For example, many of his poems are sonnets (but 
with a modern twist), and he occasionally made use of 
the blues form and acrostic poems. 

Cummings knew the right way to write. He didn't break 
the rules until he fully understood them. Cummings 
continually asked himself, “What else can language 
be made to do?” 10 

He combined his love for poetry and art by using the text 
itself as a form. He tore words apart as a way of separating 
letters and the sounds of syllables from their meaning. He 
also stretched out words and used punctuation marks 
and capital letters to add meaning or create visual and 
aural effects. He forced readers to read slowly, to relish 
sounds as they gradually put the words back together 
and discovered what the poems truly meant." 

The public did not initially enjoy his work—he broke too 
many rules, and his ideas were too far out for the general 
public to consume. For decades, he was reviled by the 


poetry community, and he struggled to make ends meet. 
As he eagerly submitted poetry to publishers, one after 
another told him, “No thanks.” After fourteen publishers 
turned down his book, he printed it himself. He called it 
No Thanks, and in it he printed the names of the fourteen 
publishers who had rejected him in a list shaped like a 
funeral urn. 12 

It wasn't until he was fifty-six years old that his poetry 
began to get the recognition that it deserved. As his 
career picked up momentum, he traveled and conducted 
readings of his poems to packed auditoriums, becoming 
the best-known poet in the United States. No American 
poet has ever been more playful than Cummings, and 
none have been more skillful at arranging words on 
a page. Many poets have imitated his style, but their 
attempts only prove how difficult it is to master that 
style. 13 He was a true revolutionary. 

It’s important to know the rules so you understand how 
to flex and even break them to create meaning. 

Many of the people who changed the world broke the 
rules and went against standard convention. They stood 
out, were different, and were even reviled at times. 
Sometimes an idea stands out so much it shocks people, 
but that’s what it takes to be noticed. Your idea might ini¬ 
tially be rejected—but take heart—persistence will move 
it from rejected to considered and eventually adopted. 
Communicate it until you know you've done everything 
in your power to help your heroes on their journey. 
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Sometimes Cummings 
pried a word open with 
a phrase wrapped in 
parentheses to show that 
two events or thoughts 
were occurring at the 
same time. 15 
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During World War II, the 
U.S. government rounded 
up Japanese Americans 
on the West Coast—many 
of them U.S. citizens—and 
forced them to live in prison 
camps. Cummings’s outrage 
found expression in poetry 
that mimics intolerance 
based on ignorance. 16 
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(If the meaning is unclear, try reading the lines aloud.) 


if there are any heavens my mother will (all by herself) have 
one. It will not be a pansy heaven nor 
a fragile heaven of lilies-of-the-valley but 
it will be a heaven of blackred roses 

my father will be (deep like a rose 
tall like a rose) 

standing near my 

(swaying over her 
silent) 

with eyes which are really petals and see 

nothing with the face of a poet really which 

is a flower and not a face with 

hands 

which whisper 
This is my beloved my 

(suddenly in sunlight 

he will bow, 

& the whole garden will bow ) 8 


Cummings imagines his 
mother and father in 
Heaven. He leaves some 
words unwritten, reminding 
us of the way speech can 
trail off into thought and 
how words unspoken can 
still be understood. 19 
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Your ideas are potent. A single idea from the human mind can change the 
world. Mozart, Hitchcock, and Cummings all revolutionized their fields by 
exploring and developing new ideas that had never existed. 

You have the opportunity to shape the future through your imagination. 
Imagining a future where your idea has been implemented will keep you 
inspired to communicate your idea passionately. 

So be flexible, be visionary, and now go rewrite all the rules. 
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